0% found this document useful (0 votes)
29 views5 pages

Overview of The Transformer-Based Models For NLP Tasks

Uploaded by

Yusong Yan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views5 pages

Overview of The Transformer-Based Models For NLP Tasks

Uploaded by

Yusong Yan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Proceedings of the Federated Conference on DOI: 10.

15439/2020F20
Computer Science and Information Systems pp. 179–183 ISSN 2300-5963 ACSIS, Vol. 21

Overview of the Transformer-based Models for


NLP Tasks
Anthony Gillioz Jacky Casas, Elena Mugellini, Omar Abou Khaled
University of Neuchâtel University of Applied Sciences and Arts Western Switzerland
Neuchâtel, Switzerland Fribourg, Switzerland
Email: anthony.gillioz@unine.ch Email: {firstname.lastname}@hes-so.ch

Abstract—In 2017, Vaswani et al. proposed a new neural providing a more direct way to the backpropagation of the
network architecture named Transformer. That modern archi- gradient. It helps the computation when the sentences are long.
tecture quickly revolutionized the natural language processing The high versatility of those networks can solve a wide
world. Models like GPT and BERT relying on this Transformer
architecture have fully outperformed the previous state-of-the- variety of problems [6]. Unfortunately, those models are not
art networks. It surpassed the earlier approaches by such a wide perfect; the inherent recurrent structure made them hard to
margin that all the recent cutting edge models seem to rely on parallelize on multiple processes, and the treatment of very
these Transformer-based architectures. long clauses is also problematic due to the vanishing gradient.
In this paper, we provide an overview and explanations of the To counter those two limiting constraints, [7] introduced a
latest models. We cover the auto-regressive models such as GPT,
GPT-2 and XLNET, as well as the auto-encoder architecture new model architecture: the Transformer. The proposed tech-
such as BERT and a lot of post-BERT models like RoBERTa, nique get rid of the recurrent architecture to rely on attention
ALBERT, ERNIE 1.0/2.0. mechanism solely. Furthermore, it does not suffer from the
gradient vanishing nor the hard parallelization issue. That
I. I NTRODUCTION facilitates and accelerates the training of broader networks.
HE understanding and the treatment of the ubiquitous
T textual data is a major research challenge. The tremen-
dous amount of data produced by our society through social
This work aims to provide a survey and an explanation of
the latest Transformer-based models.

media and companies has exploded over the past years. All II. BACKGROUND
those information are most of the time stored under textual In this section, we introduce a general NLP background. It
format. The human brain can extract the meaning out of text gives a broad insight into the unsupervised pre-training and
effortlessly, but this is not the case for a computer. It is then the NLP state-of-the-art pre-Transformers.
required to have performing and reliable techniques to treat
this data. A. Unsupervised Pre-training
The Natural Language Processing (NLP) domain aims to The unsupervised pre-training is a particular case of semi-
provide a set of techniques able to explain a wide variety of supervised learning. That is massively used to train the Trans-
Natural Language tasks such as Automatic Translation [1], former models. That principle works in two steps; the first one
Text Summarization [2], Text Generation [3]. All those tasks is the pre-training phase. It computes a general representation
have in common the meaning extraction process to be suc- from raw data in an unsupervised fashion. Second, once it is
cessful. Undoubtedly, if a technique were able to understand computed, it can be adapted to a downstream task via fine-
the underlying semantic of texts, this would help to resolve tuning techniques.
the majority of the modern NLP problems. The principal challenge is to find an unsupervised objective
A big concern that restricts a general NLP resolver is the function that generates a good representation. There is no
single-task training scheme. Gathering data and crafting a consensus on which task provides the most efficient textual de-
specific model to solve a precise problem works successfully. scription. [8] propose a language modelling task, [9] introduce
However, it forces us to come up with a solution not only a masked language modeling objective, [10] use a multi-tasks
each time a new issue arises but also to apply the model on language modeling.
another domain. A general multi-task solver may be preferable
to avoid this time-consuming point. B. Context-free representation
Recurrent Neural Networks (RNN) were massively used to The recent significant increase in the performance of NLP
solve NLP problems. They have been popular for a few years models is due to the use of word embeddings. It consists
in supervised NLP models for classification and regression. of representing a word as a unique vector. The terms with
The success of RNNs is due to the Long Short Term Memory the same meaning are located in a close area of each other.
(LSTM) [4] and Gated Recurrent Unit (GRU) [5] architectures. Word2Vec [11] and Glove [12] are the most frequently used
Those two units prevent the vanishing gradient issue by word embedding methods. They treat a large corpus of text and

IEEE Catalog Number: CFP2085N-ART ©2020, PTI 179


Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on July 07,2024 at 08:41:51 UTC from IEEE Xplore. Restrictions apply.
180 PROCEEDINGS OF THE FEDCSIS. SOFIA, 2020

produce a unique word representation in a high dimensional TABLE I


space. DATASETS COMMONLY USED WITH T RANSFORMER - BASED MODELS . (†:
T OKENIZATION DONE WITH S ENTENCE P IECE , ‡: U NCOMPRESSED DATA )
Byte Pair Encoding (BPE) [13] is another word embedding
technique using subwords units out of character-level and Dataset Size Number of tokens †
word-level representation. [14] changed the implementation BookCorpus [18]
of BPE to be based on bytes instead of Unicode characters. plus English Wikipedia 13GB 3.87B
Thus, he could reduce the vocabulary size from 100K+ to Giga5 [19] 16GB 4.75B
approximately 50K tokens. That has the advantage not to ClueWeb09 [20] 19GB 4.3B
introduce [UKN] (unknown) symbols. Besides that, it does OpenWebText [21] 38GB -
not involve a heuristic preprocessing of the input vocabulary. Real-News [22] 120GB ‡ -
It is used when the amount of corpus to treat is too large and a
more efficient technique than Word2Vec or Glove is required.
generalize correctly. That is the idea that promotes the creation
C. Attention Layer of GLUE, SQuAD V1.1/V2.0 and RACE to have benchmarks
Primarily proposed by [5], the attention mechanism aims to able to check the reliability of models on various tasks.
catch the long-term dependencies of sentences. The relation- GLUE: The General Language Understanding Evaluation
ships between entities in phrases are hard to spot. Furthermore, (GLUE) [23] is a collection of nine tasks created to test the
it is necessary to get a strong understanding of the underlying generalization of modern NLP models. It reviews a wide range
structure of sentences. Indeed, if we can have a method that of NLP problems like Sentiment Analysis, Question Answer-
can tell us how the units of a sentence are correlated in ing and inference tasks. Because of the rapid improvement
a phrase, the language understanding tasks would be more of the state-of-the-art on GLUE, SuperGLUE [24] is a new
straightforward. proposed benchmark to check general language systems but
The attention mechanism computes a relation mask between with more complicated more laborious tasks.
the words of a sentence and uses this mask in an encoder- SQuAD: Stanford Question Answering Dataset (SQuAD)
decoder architecture to detect which words are related within V1.1 [25] is a benchmark designed to resolve Reading Com-
each other. Using this process, the NLP tasks such as automatic prehension (RC) challenges. There are more than 100,000+
translation are more flexible because they can have access to questions in the data set. There is no proposed answer like
the dependencies of the sentence. In a translation context, it is in the other RD datasets. The task contains a document, and
a genuine advantage. Another notable benefit of the attention the model has to find the answer directly in the text passage.
mechanism is the straightforward human-visualization of the SQuAD v2.0 [26] is based on the same principle than the V1.1,
model’s outcome. but this time the answer is not necessarily in the questions.
RACE: Reading Comprehension From Examinations
III. DATASET (RACE) [27] is a collection of English questions set aside
The dominant strategy in the creation of deep learning sys- to Chinese students from middle school up to high school.
tems is to gather a corpus corresponding to a given problem. Each item is divided into two parts, a passage that the student
The next step is to label this data and build a network that is must read and a set of 4 potential answers. Considering
supposedly able to explain them. This method is not suitable that the questions are intended to teenagers, it requires keen
if we want to create a more comprehensive system (i.e. a reasoning skills to answer correctly to most of the problems.
system that can solve multiple problems without a significant The reasoning subjects present in RACE cover almost all
architecture change). human knowledge.
That is then essential to learn on heterogeneous data to
create general NLP models. If we want systems that can V. T RANSFORMERS
resolve several tasks at the same time, it is necessary to The RNNs (LSTM, GRU) have a recurrent underlying
train this model on a wide variety of subjects. Hopefully, in structure and are, by definition recurrent. It is then hard to
our ubiquitous data world, a large number of raw texts are parallelize the learning process because of this fundamental
available online (e.g. Wikipedia, Web blogs, Reddit). property. To overcome this issue, [7] proposed a new archi-
Table I shows the most commonly used datasets with their tecture solely based on the attention layers; the Transformer.
size and the number of tokens they contain. The tokenization It has the advantage to catch the long-range dependencies of
is done with SentencePiece [15]. In a few cases, for example, a sentence and to be parallelizable.
in [16], the authors only used a subset of those datasets (e.g.
Stories [17] is a subset of CommonCrawl dataset). A. Transformer architecture
The Transformer is based on an encoder-decoder structure,
IV. B ENCHMARKS where it takes a sequence X = (x1 , ..., xN ) and produce
During an extended period, the deep learning models have a latent representation Z = (z1 , ..., zN ). Due to the auto-
been trained to resolve one problem at a time. Further, when regressive property of this model, the output sequence YM =
those models were used in another domain, they struggle to (y1 , ..., yM ) is produced one element at a time. i.e. the

Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on July 07,2024 at 08:41:51 UTC from IEEE Xplore. Restrictions apply.
ANTHONY GILLIOZ ET AL.: OVERVIEW OF THE TRANSFORMER-BASED MODELS FOR NLP TASKS 181

word YM used the latent representation Z and the previously (BERT). This model can fuse the left and the right context of
created sequence YM −1 = (y1 , ..., yM −1 ) to be generated. a sentence, providing a bidirectional representation and allow
The Encoder and the Decoder are using the same Multi-Head a better context extractor for reasoning tasks. The architecture
Attention layer. A single Attention layer maps a query Q and of BERT is based on the Multi-Head Attention layers encoder
keys K to a weighted sum of the values V . For technical like proposed in [7]. Originally [9] proposed two versions of
reason there is a scaling factor √1d . BERT, the base version with 110M of parameters and the large
k
version with 340M parameters.
QK T Like GPT and GPT-2, BERT has an unsupervised pre-
Attention(Q, K, V ) = Sof tmax( √ )V
dk training phase where it learns its language representation.
B. Auto-Regressive Models Nevertheless, due to its inherent bidirectional architecture,
The auto-regressive models take the previous outputs to it cannot be trained using the standard Language Model
produce the next outcome. It has the particularity to be a objective. Indeed, the bidirectionality of BERT allows each
unidirectional network; it can only reach the left context word to see itself, and therefore it can trivially predict the next
of the evaluated token. However, despite this flaw, it can token. To overcome this issue and pre-train their model, [9]
learn accurate sentence representations. It relies on the regular use two unsupervised objective tasks: the Masked Language
Language Modeling (LM) task as an unsupervised pre-training Model (MLM) and the Next Sentence Prediction (NSP).
objective: Once the pre-training phase is over, it remains to fine-tune
the model to the downstream tasks. Thanks to BERT’s Trans-
X
L(X) = log P (xi |xi−k , ..., xi−1 ; Θ) former architecture, the downstream can be straightforwardly
i done because the same structure is used for the pre-training
This LM function maximizes the likelihood of the condi- and the fine-tuning. It merely needs to change the final layer
tional probability P . Where X is the input sequence, k is to match the requirements of the downstream task.
the context window, and Θ are the parameters of the Neural VI. P OST-BERT
Network.
Various models are using this property coupled with the Due to the high performance of BERT on 11 NLP tasks, a
Transformer architecture to produce accurate Language Model lot of researchers inspired by BERT’s architecture applied it
languages (i.e. it determines the statistical distribution of and tweaked it to their needs [30], [31].
the learned texts). The first auto-regressive model using the
A. BERT improvement
Transformer architecture is GPT [8]. It has a pre-training
Language Modeling phase where it learns on raw texts. In Further, studies have been done to improve the pre-training
the second learning phase, it uses supervised fine-tuning to phase of BERT. The post-BERT model RoBERTa [16] pro-
adjust the network to the downstream tasks. poses three simple modifications of the training procedure. (I)
GPT-2 [14] uses the same pre-training principles than GPT. Based on their empirical results, [16] shows that BERT is un-
Though, this time it tries to achieve the same results in a dertrained. To alleviate this problem, they propose to increase
zero-shot fashion (i.e. without fine-tuning the network to the the length of the pre-training phase. By learning longer, the
downstream tasks). To accomplish that goal, it must capture outcomes are more accurate. (II) As the results of [32] and
the full complexity of textual data. To do so, it needs a wider [14] demonstrate, the accuracy of the end-task performance
system with more parameters. The results of this model are relies on the wide variety of trained data. Therefore, BERT
competitive to some other supervised tasks on a few subjects must be trained on larger datasets. (III) In order to improve
(e.g. reading comprehension) but are far from being usable on the optimization of the model, they propose to increase the
other jobs such as summarization. batch size. There are two advantages to have a bigger batch
Another auto-regressive network is XLNet [28]. It aims size; First, the large batch size is easier to parallelize, and
to use the strength of the language modeling of the auto- second, it increases the perplexity of the MLM objective.
regressive model and at the same time, use the bidirectionality
of BERT [9]. To do so, it relies on transformer-XL [29], the B. Model reduction
state-of-the-art model for the auto-regressive network. Since the Transformer’s revolution, state-of-the-art networks
have become bigger and bigger. Accordingly, to have a better
C. BERT
language representation and better end-task results, the models
GPT and GPT-2 use a unidirectional language model; they must grow to catch the high complexity of texts. This ex-
can only reach the left context of the evaluated token. That pansion of the network’s size has a high computational cost.
property can harm the overall performance of those models More powerful GPUs and TPUs are required to train those
in reasoning or question answering tasks. Because, in those large models. If we take, for example, the Nvidia’s GPT-8B 1
topics, both sides of the sentence are crucial to getting an with 8 billion parameters, it became infeasible for small tech
optimal sentence-level understanding. companies or small labs to train a network as huge as that.
To counter this unidirectional constraint, [9] introduced
the Bidirectional Encoder Representations from Transformers 1 https://nv-adlr.github.io/MegatronLM

Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on July 07,2024 at 08:41:51 UTC from IEEE Xplore. Restrictions apply.
182 PROCEEDINGS OF THE FEDCSIS. SOFIA, 2020

It is then necessary to find smaller systems that maintain the word has a capitalized first letter), and the Token-Document
high performances of the bigger ones. Relation Prediction Task (i.e. it predicts if a token of a sentence
Working with smaller models has multiple advantages. If belongs to a document where the sentence initially appears).
the model size is shrunk, it trains faster, and the inference Structure-Aware Tasks: It learns the relationship between
time will also be reduced. If it is small enough, it can be run sentences: sentence reordering task (i.e. split and shuffle a
on smartphones or IoT devices in real-time. sentence and must find the correct order), sentence distance
One technique introduced to reduce the size of those big task (i.e. it must find if two sentences are adjacent, belong to
networks is the knowledge distillation. It is a compression the same document or if they are entirely unrelated).
method that consists of a small network (student) trained to Semantic-Aware Tasks: It learns a higher order of knowl-
reproduce the behaviour of a bigger version of itself (teacher). edge: discourse relation task (i.e. it predicts the semantic or
The teacher is primarily trained as a regular network, and after rhetorical relation of sentences), IR relevance task (i.e. find
that, it is distilled to reduce its size. DistilBERT [33] is a the relevance of information retrieval in texts).
distilled version of BERT that reduces the number of layers by
a factor of 2. It retains 97% of BERT on the GLUE benchmark D. Specific language models
while being 40% smaller and 60% faster at the inference time. In order to tackle specific languages problems, different
Another way to reduce the size of BERT is by changing monolingual versions of BERT were trained in different
the architecture itself. AlBERT [34] proposes two ideas to languages. For example BERTje [36] is a Dutch version,
decrease the number of parameters. The first approach factor- AlBERTo [37] is an Italian version, and CamemBERT [38]
izes the embedding of the parameters. It separates the large and FlauBERT [39] are two different models for French.
vocabulary embedding matrix into two smaller matrices. The These models outperform vanilla BERT in different NLP tasks
size of the hidden layer is separated from the size of the specific to these languages.
vocabulary representation. The second method is a cross-layer
parameter sharing. This technique prevents the parameters E. Cross-language model
from growing with the depth of the network. With those two XLM [40] aims to build a universal cross-language sentence
tricks, it allows reducing the size of the large BERT version embedding. The goal is to align sentence representations
by 18% without a loss of performance. Since this architecture to improve the translation between languages. To do so, a
is smaller, the training time is also faster. Transformer architecture with two unsupervised tasks and
one supervised is used. The effectiveness of cross-language
C. Multitask Learning pre-training in order to improve the multilingual machine
BERT learns several tasks sequentially and increases the translation is shown.
overall performance of the downstream end-tasks. The main is-
sue with the continual pre-training method is that it must learn VII. G OING F URTHER
efficiently and quickly newly introduced sub-tasks, and it must Despite the excellent performances of the Transformer ar-
remember what has been learned previously. The Multi-task chitecture, new layers aiming to improve the performance and
Learning (MTL) principle is based on human consideration. the complexity have been released.
If you learn how to do a first task, then a second related task The Transformer uses a gradient-based optimization proce-
is going to be more accessible to master. There are two main dure. Thus, it needs to save the activation value of all the
trends in MTL. neurons to be used during the back-propagation. Because of
The first one uses an MTL scheme during the fine-tuning the massive size of the Transformer models, the GPU/TPU’s
phase. MT-DNN [35] based on the backbone of BERT is using memory is rapidly saturated. The Reformer [41] counter the
the same pre-training procedure, but during the fine-tuning memory problem of the Transformer by recomputing the input
step, it uses four multi-tasks. Training on all the GLUE tasks at of each layer during the back-propagation instead of storing
the same time makes it gain an efficient generalization ability. the information. The Reformer can also reduce the number
On the opposite [10] proposes an MTL process directly of operations during the forward pass by computing a hash
during the pre-training step; ERNIE 2.0 introduces a continual function that pairs similar inputs together. Like that, it does not
pre-training framework. More specifically, it uses a Sequential compute all pairs of vectors to find the related ones. Therefore,
Multi-task Learning where it begins to learn a first task. it increases the size of the text it can treat at once.
When this first task is mastered, a new task is introduced Another way to improve the architecture of a network
in the continual learning process. The previously optimized is by using an evolving algorithm as proposed by [42]. To
parameters are used to initiate the model, the new task and create a new architecture designed automatically, they evolve
the previous tasks are trained concurrently. There are three a population of Transformers based on their accuracy. Using
groups of pre-training tasks, and each of them aims to capture the Progressive Dynamic Hurdles (PDH), they could reduce
a different level of semantic: the search space and the training time. With this technique
Word-Aware Tasks: It captures the lexical information of and an extensive amount of computational power (around 200
the text: the Knowledge Masking Task (i.e. it masks phrases TPUs), they could find a new architecture that outperforms the
and entities), the Capitalization prediction (i.e. it predicts if a previous one.

Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on July 07,2024 at 08:41:51 UTC from IEEE Xplore. Restrictions apply.
ANTHONY GILLIOZ ET AL.: OVERVIEW OF THE TRANSFORMER-BASED MODELS FOR NLP TASKS 183

VIII. C ONCLUSION [18] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Tor-


ralba, and S. Fidler, “Aligning Books and Movies: Towards Story-
The Transformer-based networks have pushed the like Visual Explanations by Watching Movies and Reading Books,”
reasoning-skills to human-level abilities. It can even excel the arXiv:1506.06724 [cs], June 2015. arXiv: 1506.06724.
[19] R. Parker, D. Graff, and J. Kong, “English gigaword,” Linguistic Data
human capabilities on a few tasks of GLUE. Transformer- Consortium, Jan. 2011.
based networks have changed the face of NLP tasks. They [20] J. Callan, M. Hoy, C. Yoo, and L. Zhao, “The ClueWeb09 Dataset -
can go far beyond the results obtained with RNNs, and they Dataset Information and Sample Files,” Jan. 2009.
[21] A. Gokaslan and V. Cohen, OpenWebText Corpus. Jan. 2019.
can do it faster. They have helped solve many problems at the [22] R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner,
same time by providing a direct and efficient way to combine and Y. Choi, “Defending Against Neural Fake News,” arXiv:1905.12616
several downstream tasks. Nevertheless, much work remains [cs], Oct. 2019. arXiv: 1905.12616.
[23] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman,
before having a system with a human-level comprehension “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural
of the underlying meaning of texts, that is also sufficiently Language Understanding,” arXiv:1804.07461 [cs], Feb. 2019. arXiv:
small to run on devices with low computational power. 1804.07461.
[24] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill,
R EFERENCES O. Levy, and S. R. Bowman, “SuperGLUE: A Stickier Benchmark for
General-Purpose Language Understanding Systems,” arXiv:1905.00537
[1] F. J. Och and H. Ney, “The Alignment Template Approach to Statistical [cs], July 2019. arXiv: 1905.00537.
Machine Translation,” Computational Linguistics, vol. 30, pp. 417–449, [25] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+
Dec. 2004. Questions for Machine Comprehension of Text,” arXiv:1606.05250 [cs],
[2] A. M. Rush, S. Chopra, and J. Weston, “A Neural Attention Model Oct. 2016. arXiv: 1606.05250.
for Abstractive Sentence Summarization,” arXiv:1509.00685 [cs], Sept. [26] P. Rajpurkar, R. Jia, and P. Liang, “Know What You Don’t Know:
2015. arXiv: 1509.00685. Unanswerable Questions for SQuAD,” arXiv:1806.03822 [cs], June
[3] L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence Generative 2018. arXiv: 1806.03822.
Adversarial Nets with Policy Gradient,” arXiv:1609.05473 [cs], Aug. [27] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy, “RACE:
2017. arXiv: 1609.05473. Large-scale ReAding Comprehension Dataset From Examinations,”
[4] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural arXiv:1704.04683 [cs], Dec. 2017. arXiv: 1704.04683.
Computation, vol. 9, pp. 1735–1780, Nov. 1997. [28] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le,
[5] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, “XLNet: Generalized Autoregressive Pretraining for Language Under-
H. Schwenk, and Y. Bengio, “Learning Phrase Representations standing,” arXiv:1906.08237 [cs], June 2019. arXiv: 1906.08237.
using RNN Encoder-Decoder for Statistical Machine Translation,” [29] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov,
arXiv:1406.1078 [cs, stat], Sept. 2014. arXiv: 1406.1078. “Transformer-XL: Attentive Language Models Beyond a Fixed-Length
[6] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmid- Context,” arXiv:1901.02860 [cs, stat], June 2019. arXiv: 1901.02860.
huber, “LSTM: A Search Space Odyssey,” IEEE Transactions on Neural [30] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid,
Networks and Learning Systems, vol. 28, pp. 2222–2232, Oct. 2017. “VideoBERT: A Joint Model for Video and Language Representation
arXiv: 1503.04069. Learning,” arXiv:1904.01766 [cs], Sept. 2019. arXiv: 1904.01766.
[7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. [31] A. Wang and K. Cho, “BERT has a Mouth, and It Must Speak: BERT
Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” as a Markov Random Field Language Model,” arXiv:1902.04094 [cs],
arXiv:1706.03762 [cs], Dec. 2017. arXiv: 1706.03762. Apr. 2019. arXiv: 1902.04094 version: 2.
[8] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving [32] A. Baevski, S. Edunov, Y. Liu, L. Zettlemoyer, and M. Auli, “Cloze-
Language Understanding by Generative Pre-Training,” p. 12, Nov. 2018. driven Pretraining of Self-attention Networks,” arXiv:1903.07785 [cs],
[9] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- Mar. 2019. arXiv: 1903.07785.
training of Deep Bidirectional Transformers for Language Understand- [33] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled
ing,” arXiv:1810.04805 [cs], May 2019. arXiv: 1810.04805. version of BERT: smaller, faster, cheaper and lighter,” arXiv:1910.01108
[10] Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, and H. Wang, “ERNIE [cs], Oct. 2019. arXiv: 1910.01108.
2.0: A Continual Pre-training Framework for Language Understanding,” [34] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,
arXiv:1907.12412 [cs], Nov. 2019. arXiv: 1907.12412. “ALBERT: A Lite BERT for Self-supervised Learning of Language
[11] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation Representations,” arXiv:1909.11942 [cs], Oct. 2019. arXiv: 1909.11942
of Word Representations in Vector Space,” arXiv:1301.3781 [cs], Sept. version: 3.
2013. arXiv: 1301.3781. [35] X. Liu, P. He, W. Chen, and J. Gao, “Multi-Task Deep Neural Networks
[12] J. Pennington, R. Socher, and C. Manning, “Glove: Global Vectors for Natural Language Understanding,” arXiv:1901.11504 [cs], May
for Word Representation,” in Proceedings of the 2014 Conference on 2019. arXiv: 1901.11504.
Empirical Methods in Natural Language Processing (EMNLP), (Doha, [36] W. de Vries, A. van Cranenburgh, A. Bisazza, T. Caselli, G. van Noord,
Qatar), pp. 1532–1543, Association for Computational Linguistics, 2014. and M. Nissim, “BERTje: A Dutch BERT Model,” arXiv:1912.09582
[13] R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of [cs], Dec. 2019. arXiv: 1912.09582.
Rare Words with Subword Units,” arXiv:1508.07909 [cs], June 2016. [37] M. Polignano, P. Basile, and M. de Gemmis, “ALBERTO: Italian BERT
arXiv: 1508.07909. Language Understanding Model for NLP Challenging Tasks Based on
[14] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, Tweets,” p. 6, 2019.
“Language Models are Unsupervised Multitask Learners,” p. 24, Nov. [38] L. Martin, B. Muller, P. J. O. Suárez, Y. Dupont, L. Romary, E. V.
2019. de la Clergerie, D. Seddah, and B. Sagot, “CamemBERT: a Tasty
[15] T. Kudo and J. Richardson, “SentencePiece: A simple and language inde- French Language Model,” arXiv:1911.03894 [cs], May 2020. arXiv:
pendent subword tokenizer and detokenizer for Neural Text Processing,” 1911.03894.
arXiv:1808.06226 [cs], Aug. 2018. arXiv: 1808.06226. [39] H. Le, L. Vial, J. Frej, V. Segonne, M. Coavoux, B. Lecouteux,
[16] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, A. Allauzen, B. Crabbé, L. Besacier, and D. Schwab, “FlauBERT: Un-
L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized supervised Language Model Pre-training for French,” arXiv:1912.05372
BERT Pretraining Approach,” arXiv:1907.11692 [cs], July 2019. arXiv: [cs], Mar. 2020. arXiv: 1912.05372.
1907.11692 version: 1. [40] G. Lample and A. Conneau, “Cross-lingual Language Model Pretrain-
[17] T. H. Trinh and Q. V. Le, “A Simple Method for Commonsense ing,” arXiv:1901.07291 [cs], Jan. 2019. arXiv: 1901.07291.
Reasoning,” arXiv:1806.02847 [cs], Sept. 2019. arXiv: 1806.02847. [41] N. Kitaev, L. Kaiser, and A. Levskaya, “Reformer: The Efficient Trans-
former,” arXiv:2001.04451 [cs, stat], Jan. 2020. arXiv: 2001.04451.
[42] D. R. So, C. Liang, and Q. V. Le, “The Evolved Transformer,”
arXiv:1901.11117 [cs, stat], May 2019. arXiv: 1901.11117.

Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on July 07,2024 at 08:41:51 UTC from IEEE Xplore. Restrictions apply.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy