Bert
Bert
Language Understanding
Jacob Devlin
Ming-Wei Chang Kenton Lee Kristina Toutanova
Google AI Language
{jacobdevlin,mingweichang,kentonl,kristout}@google.com
C T1 ... TN T[SEP] T1’ ... TM’ C T1 ... TN T[SEP] T1’ ... TM’
[CLS] Tok 1 ... Tok N [SEP] Tok 1 ... TokM [CLS] Tok 1 ... Tok N [SEP] Tok 1 ... TokM
Pre-training Fine-Tuning
Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architec-
tures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize
models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special
symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques-
tions/answers).
ing and auto-encoder objectives have been used mal difference between the pre-trained architec-
for pre-training such models (Howard and Ruder, ture and the final downstream architecture.
2018; Radford et al., 2018; Dai and Le, 2015).
Model Architecture BERT’s model architec-
2.3 Transfer Learning from Supervised Data ture is a multi-layer bidirectional Transformer en-
There has also been work showing effective trans- coder based on the original implementation de-
fer from supervised tasks with large datasets, such scribed in Vaswani et al. (2017) and released in
as natural language inference (Conneau et al., the tensor2tensor library.1 Because the use
2017) and machine translation (McCann et al., of Transformers has become common and our im-
2017). Computer vision research has also demon- plementation is almost identical to the original,
strated the importance of transfer learning from we will omit an exhaustive background descrip-
large pre-trained models, where an effective recipe tion of the model architecture and refer readers to
is to fine-tune models pre-trained with Ima- Vaswani et al. (2017) as well as excellent guides
geNet (Deng et al., 2009; Yosinski et al., 2014). such as “The Annotated Transformer.”2
In this work, we denote the number of layers
3 BERT (i.e., Transformer blocks) as L, the hidden size as
H, and the number of self-attention heads as A.3
We introduce BERT and its detailed implementa- We primarily report results on two model sizes:
tion in this section. There are two steps in our BERTBASE (L=12, H=768, A=12, Total Param-
framework: pre-training and fine-tuning. Dur- eters=110M) and BERTLARGE (L=24, H=1024,
ing pre-training, the model is trained on unlabeled A=16, Total Parameters=340M).
data over different pre-training tasks. For fine- BERTBASE was chosen to have the same model
tuning, the BERT model is first initialized with size as OpenAI GPT for comparison purposes.
the pre-trained parameters, and all of the param- Critically, however, the BERT Transformer uses
eters are fine-tuned using labeled data from the bidirectional self-attention, while the GPT Trans-
downstream tasks. Each downstream task has sep- former uses constrained self-attention where every
arate fine-tuned models, even though they are ini- token can only attend to context to its left.4
tialized with the same pre-trained parameters. The
1
question-answering example in Figure 1 will serve https://github.com/tensorflow/tensor2tensor
2
as a running example for this section. http://nlp.seas.harvard.edu/2018/04/03/attention.html
3
In all cases we set the feed-forward/filter size to be 4H,
A distinctive feature of BERT is its unified ar- i.e., 3072 for the H = 768 and 4096 for the H = 1024.
4
chitecture across different tasks. There is mini- We note that in the literature the bidirectional Trans-
Input/Output Representations To make BERT In order to train a deep bidirectional representa-
handle a variety of down-stream tasks, our input tion, we simply mask some percentage of the input
representation is able to unambiguously represent tokens at random, and then predict those masked
both a single sentence and a pair of sentences tokens. We refer to this procedure as a “masked
(e.g., h Question, Answer i) in one token sequence. LM” (MLM), although it is often referred to as a
Throughout this work, a “sentence” can be an arbi- Cloze task in the literature (Taylor, 1953). In this
trary span of contiguous text, rather than an actual case, the final hidden vectors corresponding to the
linguistic sentence. A “sequence” refers to the in- mask tokens are fed into an output softmax over
put token sequence to BERT, which may be a sin- the vocabulary, as in a standard LM. In all of our
gle sentence or two sentences packed together. experiments, we mask 15% of all WordPiece to-
We use WordPiece embeddings (Wu et al., kens in each sequence at random. In contrast to
2016) with a 30,000 token vocabulary. The first denoising auto-encoders (Vincent et al., 2008), we
token of every sequence is always a special clas- only predict the masked words rather than recon-
sification token ([CLS]). The final hidden state structing the entire input.
corresponding to this token is used as the ag- Although this allows us to obtain a bidirec-
gregate sequence representation for classification tional pre-trained model, a downside is that we
tasks. Sentence pairs are packed together into a are creating a mismatch between pre-training and
single sequence. We differentiate the sentences in fine-tuning, since the [MASK] token does not ap-
two ways. First, we separate them with a special pear during fine-tuning. To mitigate this, we do
token ([SEP]). Second, we add a learned embed- not always replace “masked” words with the ac-
ding to every token indicating whether it belongs tual [MASK] token. The training data generator
to sentence A or sentence B. As shown in Figure 1, chooses 15% of the token positions at random for
we denote input embedding as E, the final hidden prediction. If the i-th token is chosen, we replace
vector of the special [CLS] token as C ∈ RH , the i-th token with (1) the [MASK] token 80% of
and the final hidden vector for the ith input token the time (2) a random token 10% of the time (3)
as Ti ∈ RH . the unchanged i-th token 10% of the time. Then,
For a given token, its input representation is Ti will be used to predict the original token with
constructed by summing the corresponding token, cross entropy loss. We compare variations of this
segment, and position embeddings. A visualiza- procedure in Appendix C.2.
tion of this construction can be seen in Figure 2.
Task #2: Next Sentence Prediction (NSP)
3.1 Pre-training BERT Many important downstream tasks such as Ques-
Unlike Peters et al. (2018a) and Radford et al. tion Answering (QA) and Natural Language Infer-
(2018), we do not use traditional left-to-right or ence (NLI) are based on understanding the rela-
right-to-left language models to pre-train BERT. tionship between two sentences, which is not di-
Instead, we pre-train BERT using two unsuper- rectly captured by language modeling. In order
vised tasks, described in this section. This step to train a model that understands sentence rela-
is presented in the left part of Figure 1. tionships, we pre-train for a binarized next sen-
tence prediction task that can be trivially gener-
Task #1: Masked LM Intuitively, it is reason- ated from any monolingual corpus. Specifically,
able to believe that a deep bidirectional model is when choosing the sentences A and B for each pre-
strictly more powerful than either a left-to-right training example, 50% of the time B is the actual
model or the shallow concatenation of a left-to- next sentence that follows A (labeled as IsNext),
right and a right-to-left model. Unfortunately, and 50% of the time it is a random sentence from
standard conditional language models can only be the corpus (labeled as NotNext). As we show
trained left-to-right or right-to-left, since bidirec- in Figure 1, C is used for next sentence predic-
tional conditioning would allow each word to in- tion (NSP).5 Despite its simplicity, we demon-
directly “see itself”, and the model could trivially strate in Section 5.1 that pre-training towards this
predict the target word in a multi-layered context. task is very beneficial to both QA and NLI. 6
5
former is often referred to as a “Transformer encoder” while The final model achieves 97%-98% accuracy on NSP.
6
the left-context-only version is referred to as a “Transformer The vector C is not a meaningful sentence representation
decoder” since it can be used for text generation. without fine-tuning, since it was trained with NSP.
Input [CLS] my dog is cute [SEP] he likes play ##ing [SEP]
Token
E[CLS] Emy Edog Eis Ecute E[SEP] Ehe Elikes Eplay E##ing E[SEP]
Embeddings
Segment
Embeddings EA EA EA EA EA EA EB EB EB EB EB
Position
Embeddings E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
Figure 2: BERT input representation. The input embeddings are the sum of the token embeddings, the segmenta-
tion embeddings and the position embeddings.
The NSP task is closely related to representation- (4) a degenerate text-∅ pair in text classification
learning objectives used in Jernite et al. (2017) and or sequence tagging. At the output, the token rep-
Logeswaran and Lee (2018). However, in prior resentations are fed into an output layer for token-
work, only sentence embeddings are transferred to level tasks, such as sequence tagging or question
down-stream tasks, where BERT transfers all pa- answering, and the [CLS] representation is fed
rameters to initialize end-task model parameters. into an output layer for classification, such as en-
tailment or sentiment analysis.
Pre-training data The pre-training procedure Compared to pre-training, fine-tuning is rela-
largely follows the existing literature on language tively inexpensive. All of the results in the pa-
model pre-training. For the pre-training corpus we per can be replicated in at most 1 hour on a sin-
use the BooksCorpus (800M words) (Zhu et al., gle Cloud TPU, or a few hours on a GPU, starting
2015) and English Wikipedia (2,500M words). from the exact same pre-trained model.7 We de-
For Wikipedia we extract only the text passages scribe the task-specific details in the correspond-
and ignore lists, tables, and headers. It is criti- ing subsections of Section 4. More details can be
cal to use a document-level corpus rather than a found in Appendix A.5.
shuffled sentence-level corpus such as the Billion
Word Benchmark (Chelba et al., 2013) in order to 4 Experiments
extract long contiguous sequences.
In this section, we present BERT fine-tuning re-
3.2 Fine-tuning BERT sults on 11 NLP tasks.
Fine-tuning is straightforward since the self- 4.1 GLUE
attention mechanism in the Transformer al-
The General Language Understanding Evaluation
lows BERT to model many downstream tasks—
(GLUE) benchmark (Wang et al., 2018a) is a col-
whether they involve single text or text pairs—by
lection of diverse natural language understanding
swapping out the appropriate inputs and outputs.
tasks. Detailed descriptions of GLUE datasets are
For applications involving text pairs, a common
included in Appendix B.1.
pattern is to independently encode text pairs be-
To fine-tune on GLUE, we represent the input
fore applying bidirectional cross attention, such
sequence (for single sentence or sentence pairs)
as Parikh et al. (2016); Seo et al. (2017). BERT
as described in Section 3, and use the final hid-
instead uses the self-attention mechanism to unify
den vector C ∈ RH corresponding to the first
these two stages, as encoding a concatenated text
input token ([CLS]) as the aggregate representa-
pair with self-attention effectively includes bidi-
tion. The only new parameters introduced during
rectional cross attention between two sentences.
fine-tuning are classification layer weights W ∈
For each task, we simply plug in the task- RK×H , where K is the number of labels. We com-
specific inputs and outputs into BERT and fine- pute a standard classification loss with C and W ,
tune all the parameters end-to-end. At the in- i.e., log(softmax(CW T )).
put, sentence A and sentence B from pre-training
7
are analogous to (1) sentence pairs in paraphras- For example, the BERT SQuAD model can be trained in
around 30 minutes on a single Cloud TPU to achieve a Dev
ing, (2) hypothesis-premise pairs in entailment, (3) F1 score of 91.0%.
8
question-passage pairs in question answering, and See (10) in https://gluebenchmark.com/faq.
System MNLI-(m/mm) QQP QNLI SST-2 CoLA STS-B MRPC RTE Average
392k 363k 108k 67k 8.5k 5.7k 3.5k 2.5k -
Pre-OpenAI SOTA 80.6/80.1 66.1 82.3 93.2 35.0 81.0 86.0 61.7 74.0
BiLSTM+ELMo+Attn 76.4/76.1 64.8 79.8 90.4 36.0 73.3 84.9 56.8 71.0
OpenAI GPT 82.1/81.4 70.3 87.4 91.3 45.4 80.0 82.3 56.0 75.1
BERTBASE 84.6/83.4 71.2 90.5 93.5 52.1 85.8 88.9 66.4 79.6
BERTLARGE 86.7/85.9 72.1 92.7 94.9 60.5 86.5 89.3 70.1 82.1
We use a batch size of 32 and fine-tune for 3 Wikipedia containing the answer, the task is to
epochs over the data for all GLUE tasks. For each predict the answer text span in the passage.
task, we selected the best fine-tuning learning rate As shown in Figure 1, in the question answer-
(among 5e-5, 4e-5, 3e-5, and 2e-5) on the Dev set. ing task, we represent the input question and pas-
Additionally, for BERTLARGE we found that fine- sage as a single packed sequence, with the ques-
tuning was sometimes unstable on small datasets, tion using the A embedding and the passage using
so we ran several random restarts and selected the the B embedding. We only introduce a start vec-
best model on the Dev set. With random restarts, tor S ∈ RH and an end vector E ∈ RH during
we use the same pre-trained checkpoint but per- fine-tuning. The probability of word i being the
form different fine-tuning data shuffling and clas- start of the answer span is computed as a dot prod-
sifier layer initialization.9 uct between Ti and S followed by a softmax over
S·Ti
Results are presented in Table 1. Both all of the words in the paragraph: Pi = Pe S·T j
.
j e
BERTBASE and BERTLARGE outperform all sys-
The analogous formula is used for the end of the
tems on all tasks by a substantial margin, obtaining
answer span. The score of a candidate span from
4.5% and 7.0% respective average accuracy im-
position i to position j is defined as S·Ti + E·Tj ,
provement over the prior state of the art. Note that
and the maximum scoring span where j ≥ i is
BERTBASE and OpenAI GPT are nearly identical
used as a prediction. The training objective is the
in terms of model architecture apart from the at-
sum of the log-likelihoods of the correct start and
tention masking. For the largest and most widely
end positions. We fine-tune for 3 epochs with a
reported GLUE task, MNLI, BERT obtains a 4.6%
learning rate of 5e-5 and a batch size of 32.
absolute accuracy improvement. On the official
Table 2 shows top leaderboard entries as well
GLUE leaderboard10 , BERTLARGE obtains a score
as results from top published systems (Seo et al.,
of 80.5, compared to OpenAI GPT, which obtains
2017; Clark and Gardner, 2018; Peters et al.,
72.8 as of the date of writing.
2018a; Hu et al., 2018). The top results from the
We find that BERTLARGE significantly outper-
SQuAD leaderboard do not have up-to-date public
forms BERTBASE across all tasks, especially those
system descriptions available,11 and are allowed to
with very little training data. The effect of model
use any public data when training their systems.
size is explored more thoroughly in Section 5.2.
We therefore use modest data augmentation in
4.2 SQuAD v1.1 our system by first fine-tuning on TriviaQA (Joshi
et al., 2017) befor fine-tuning on SQuAD.
The Stanford Question Answering Dataset Our best performing system outperforms the top
(SQuAD v1.1) is a collection of 100k crowd- leaderboard system by +1.5 F1 in ensembling and
sourced question/answer pairs (Rajpurkar et al., +1.3 F1 as a single system. In fact, our single
2016). Given a question and a passage from BERT model outperforms the top ensemble sys-
9
The GLUE data set distribution does not include the Test tem in terms of F1 score. Without TriviaQA fine-
labels, and we only made a single GLUE evaluation server
submission for each of BERTBASE and BERTLARGE . 11
QANet is described in Yu et al. (2018), but the system
10
https://gluebenchmark.com/leaderboard has improved substantially after publication.
System Dev Test System Dev Test
EM F1 EM F1
ESIM+GloVe 51.9 52.7
Top Leaderboard Systems (Dec 10th, 2018) ESIM+ELMo 59.1 59.2
Human - - 82.3 91.2 OpenAI GPT - 78.0
#1 Ensemble - nlnet - - 86.0 91.7
BERTBASE 81.6 -
#2 Ensemble - QANet - - 84.5 90.5
BERTLARGE 86.6 86.3
Published
BiDAF+ELMo (Single) - 85.6 - 85.8 Human (expert)† - 85.0
R.M. Reader (Ensemble) 81.2 87.9 82.3 88.5 Human (5 annotations)† - 88.0
Ours
BERTBASE (Single) 80.8 88.5 - - Table 4: SWAG Dev and Test accuracies. † Human per-
BERTLARGE (Single) 84.1 90.9 - - formance is measured with 100 samples, as reported in
BERTLARGE (Ensemble) 85.8 91.8 - - the SWAG paper.
BERTLARGE (Sgl.+TriviaQA) 84.2 91.1 85.1 91.8
BERTLARGE (Ens.+TriviaQA) 86.2 92.2 87.4 93.2
Rie Kubota Ando and Tong Zhang. 2005. A framework Alexis Conneau, Douwe Kiela, Holger Schwenk, Loı̈c
for learning predictive structures from multiple tasks Barrault, and Antoine Bordes. 2017. Supervised
and unlabeled data. Journal of Machine Learning learning of universal sentence representations from
Research, 6(Nov):1817–1853. natural language inference data. In Proceedings of
the 2017 Conference on Empirical Methods in Nat-
ural Language Processing, pages 670–680, Copen-
Luisa Bentivogli, Bernardo Magnini, Ido Dagan,
hagen, Denmark. Association for Computational
Hoa Trang Dang, and Danilo Giampiccolo. 2009.
Linguistics.
The fifth PASCAL recognizing textual entailment
challenge. In TAC. NIST.
Andrew M Dai and Quoc V Le. 2015. Semi-supervised
sequence learning. In Advances in neural informa-
John Blitzer, Ryan McDonald, and Fernando Pereira. tion processing systems, pages 3079–3087.
2006. Domain adaptation with structural correspon-
dence learning. In Proceedings of the 2006 confer- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
ence on empirical methods in natural language pro- Fei. 2009. ImageNet: A Large-Scale Hierarchical
cessing, pages 120–128. Association for Computa- Image Database. In CVPR09.
tional Linguistics.
William B Dolan and Chris Brockett. 2005. Automati-
Samuel R. Bowman, Gabor Angeli, Christopher Potts, cally constructing a corpus of sentential paraphrases.
and Christopher D. Manning. 2015. A large anno- In Proceedings of the Third International Workshop
tated corpus for learning natural language inference. on Paraphrasing (IWP2005).
In EMNLP. Association for Computational Linguis-
tics. William Fedus, Ian Goodfellow, and Andrew M Dai.
2018. Maskgan: Better text generation via filling in
Peter F Brown, Peter V Desouza, Robert L Mercer, the . arXiv preprint arXiv:1801.07736.
Vincent J Della Pietra, and Jenifer C Lai. 1992.
Class-based n-gram models of natural language. Dan Hendrycks and Kevin Gimpel. 2016. Bridging
Computational linguistics, 18(4):467–479. nonlinearities and stochastic regularizers with gaus-
sian error linear units. CoRR, abs/1606.08415.
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-
Gazpio, and Lucia Specia. 2017. Semeval-2017 Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016.
task 1: Semantic textual similarity multilingual and Learning distributed representations of sentences
crosslingual focused evaluation. In Proceedings from unlabelled data. In Proceedings of the 2016
of the 11th International Workshop on Semantic Conference of the North American Chapter of the
Evaluation (SemEval-2017), pages 1–14, Vancou- Association for Computational Linguistics: Human
ver, Canada. Association for Computational Lin- Language Technologies. Association for Computa-
guistics. tional Linguistics.