Language Models As Knowledge Bases?
Language Models As Knowledge Bases?
Transformer-XL: Dai et al. (2019) introduce a tion to this pseudo language model objective, they
large-scale language model based on the Trans- use an auxiliary binary classification objective to
former (Vaswani et al., 2017). Transformer-XL predict whether a particular sentence follows the
can take into account a longer history by caching given sequence of words.
previous outputs and by using relative instead of
absolute positional encoding. It achieves a test 3 Related Work
perplexity of 18.3 on the WikiText-103 corpus.
Many studies have investigated pretrained word
2.2 Bidirectional “Language Models”2 representations, sentence representations, and lan-
So far, we have looked at language models that guage models. Existing work focuses on un-
predict the next word given a history of words. derstanding linguistic and semantic properties of
However, in many downstream applications we word representations or how well pretrained sen-
mostly care about having access to contextual rep- tence representations and language models trans-
resentations of words, i.e., word representations fer linguistic knowledge to downstream tasks. In
that are a function of the entire context of a unit contrast, our investigation seeks to answer to what
of text such as a sentence or paragraph, and not extent pretrained language models store factual
only conditioned on previous words. Formally, and commonsense knowledge by comparing them
given an input sequence w = [w1 , w2 , . . . , wN ] with symbolic knowledge bases populated by tra-
and a position 1 ≤ i ≤ N, we want to esti- ditional relation extraction approaches.
mate p(wi ) = p(wi | w1 , . . . , wi−1 , wi+1 , . . . , wN ) us- Baroni et al. (2014) present a systematic com-
ing the left and right context of that word. parative analysis between neural word represen-
ELMo: To estimate this probability, Peters et al. tation methods and more traditional count-based
(2018a) propose running a forward and backward distributional semantic methods on lexical seman-
LSTM (Hochreiter and Schmidhuber, 1997), re- tics tasks like semantic relatedness and concept
→
− ←− categorization. They find that neural word rep-
sulting in h i and h i which consequently are used
to calculate a forward and backward language resentations outperform count-based distributional
model log-likelihood. Their model, ELMo, uses methods on the majority of the considered tasks.
multiple layers of LSTMs and it has been pre- Hill et al. (2015) investigate to what degree word
trained on the Google Billion Word dataset. An- representations capture semantic meaning as mea-
other version of the model, ELMo 5.5B, has been sured by similarity between word pairs.
trained on the English Wikipedia and monolingual Marvin and Linzen (2018) assess the gram-
news crawl data from WMT 2008-2012. maticality of pretrained language models. Their
BERT: Instead of a standard language model ob- dataset consists of sentence pairs with a grammat-
jective, Devlin et al. (2018a) propose to sample ical and an ungrammatical sentence. While a good
positions in the input sequence randomly and to language model should assign higher probability
learn to fill the word at the masked position. To to the grammatical sentence, they find that LSTMs
this end, they employ a Transformer architecture do not learn syntax well.
and train it on the BookCorpus (Zhu et al., 2015) Another line of work investigates the ability of
as well as a crawl of English Wikipedia. In addi- pretrained sentence and language models to trans-
2
fer knowledge to downstream natural language un-
Contextual representation models (Tenney et al., 2019)
might be a better name, but we keep calling them language derstanding tasks (Wang et al., 2018). While such
models for simplicity. an analysis sheds light on the transfer-learning
abilities of pretrained models for understanding 4 The LAMA Probe
short pieces of text, it provides little insight into
whether these models can compete with traditional We introduce the LAMA (LAnguage Model Anal-
approaches to representing knowledge like sym- ysis) probe to test the factual and commonsense
bolic knowledge bases. knowledge in language models. It provides a set
of knowledge sources which are composed of a
More recently, McCoy et al. (2019) found that corpus of facts. Facts are either subject-relation-
for natural language inference, a model based on object triples or question-answer pairs. Each fact
BERT learns to rely heavily on fallible syntac- is converted into a cloze statement which is used to
tic heuristics instead of a deeper understanding of query the language model for a missing token. We
the natural language input. Peters et al. (2018b) evaluate each model based on how highly it ranks
found that lower layers in ELMo specialize on lo- the ground truth token against every other word
cal syntactic relationships, while higher layers can in a fixed candidate vocabulary. This is similar
learn to model long-range relationships. Similarly, to ranking-based metrics from the knowledge base
Goldberg (2019) found that BERT captures En- completion literature (Bordes et al., 2013; Nickel
glish syntactic phenomena remarkably well. Ten- et al., 2016). Our assumption is that models which
ney et al. (2019) investigate to what extent lan- rank ground truth tokens high for these cloze state-
guage models encode sentence structure for differ- ments have more factual knowledge. We discuss
ent syntactic and semantic phenomena and found each step in detail next and provide considerations
that they excel for the former but only provide on the probe below.
small improvements for tasks that fall into the lat-
4.1 Knowledge Sources
ter category. While this provides insights into the
linguistic knowledge of language models, it does To assess the different language models in Sec-
not provide insights into their factual and com- tion 2, we cover a variety of sources of factual
monsense knowledge. and commonsense knowledge. For each source,
we describe the origin of fact triples (or question-
Radford et al. (2018) introduce a pretrained lan- answer pairs), how we transform them into cloze
guage model based on the Transformer which they templates, and to what extent aligned texts exist
termed generative pretraining (GPTv1). The first in Wikipedia that are known to express a partic-
version of GPT (Radford et al., 2018) has been ular fact. We use the latter information in super-
trained on the Book Corpus (Zhu et al., 2015) con- vised baselines that extract knowledge representa-
taining 7000 books. The closest to our investiga- tions directly from the aligned text.
tion is the work by Radford et al. (2019) which
4.1.1 Google-RE
introduces GPTv2 and investigates how well their
language model does zero-shot transfer to a range The Google-RE corpus3 contains ∼60K facts man-
of downstream tasks. They find that GPTv2 ually extracted from Wikipedia. It covers five re-
achieves an F1 of 55 for answering questions in lations but we consider only three of them, namely
CoQA (Reddy et al., 2018) and 4.1% accuracy on “place of birth”, “date of birth” and “place of
the Natural Questions dataset (Kwiatkowski et al., death”. We exclude the other two because they
2019), in both cases without making use of anno- contain mainly multi-tokens objects that are not
tated question-answer pairs or an information re- supported in our evaluation. We manually define
trieval step. While these results are encouraging a template for each considered relation, e.g., “[S]
and hint at the ability of very large pretrained lan- was born in [O]” for “place of birth”. Each fact
guage models to memorize factual knowledge, the in the Google-RE dataset is, by design, manually
large GPTv2 model has not been made public and aligned to a short piece of Wikipedia text support-
the publicly available small version achieves less ing it.
than 1% on Natural Questions (5.3 times worse
4.1.2 T-REx
than the large model). Thus, we decided to not
include GPTv2 in our study. Similarly, we do not The T-REx knowledge source is a subset of
include GPTv1 in this study as it uses a limited Wikidata triples. It is derived from the T-REx
lower-cased vocabulary, making it incompatible to 3
https://code.google.com/archive/p/
the way we assess the other language models. relation-extraction-corpus/
dataset (Elsahar et al., 2018) and is much larger Assume we want to compute the generation for
than Google-RE with a broader set of relations. the token at position t. For unidirectional language
We consider 41 Wikidata relations and subsam- models, we use the network output (ht−1 ) just be-
ple at most 1000 facts per relation. As with the fore the token to produce the output layer soft-
Google-RE corpus, we manually define a tem- max. For ELMo we consider the output just be-
plate for each relation (see Table 3 for some ex- →
−
fore ( h t−1 ) for the forward direction and just after
amples). In contrast to the Google-RE knowledge ←−
( h t+1 ) for the backward direction. Following the
source, T-REx facts were automatically aligned to loss definition in (Peters et al., 2018a), we average
Wikipedia and hence this alignment can be noisy. forward and backward probabilities from the cor-
However, Elsahar et al. (2018) report an accuracy responding softmax layers. For BERT, we mask
of 97.8% for the alignment technique over a test the token at position t, and we feed the output vec-
set. tor corresponding to the masked token (ht ) into the
4.1.3 ConceptNet softmax layer. To allow a fair comparison, we let
models generate over a unified vocabulary, which
ConceptNet (Speer and Havasi, 2012) is a multi-
is the intersection of the vocabularies for all con-
lingual knowledge base, initially built on top of
sidered models (∼21K case-sensitive tokens).
Open Mind Common Sense (OMCS) sentences.
OMCS represents commonsense relationships be- 4.3 Baselines
tween words and/or phrases. We consider facts
from the English part of ConceptNet that have To compare language models to canonical ways
single-token objects covering 16 relations. For of using off-the-shelf systems for extracting sym-
these ConceptNet triples, we find the OMCS sen- bolic knowledge and answering questions, we
tence that contains both the subject and the object. consider the following baselines.
We then mask the object within the sentence and Freq: For a subject and relation pair, this baseline
use the sentence as template for querying language ranks words based on how frequently they appear
models. If there are several sentences for a triple, as objects for the given relation in the test data. It
we pick one at random. Note that for this knowl- indicates the upper bound performance of a model
edge source there is no explicit alignment of facts that always predicts the same objects for a partic-
to Wikipedia sentences. ular relation.
RE: For the relation-based knowledge sources, we
4.1.4 SQuAD consider the pretrained Relation Extraction (RE)
SQuAD (Rajpurkar et al., 2016) is a popular ques- model of Sorokin and Gurevych (2017). This
tion answering dataset. We select a subset of 305 model was trained on a subcorpus of Wikipedia
context-insensitive questions from the SQuAD de- annotated with Wikidata relations. It extracts rela-
velopment set with single token answers. We man- tion triples from a given sentence using an LSTM-
ually create cloze-style questions from these ques- based encoder and an attention mechanism. Based
tions, e.g., rewriting “Who developed the theory of on the alignment information from the knowledge
relativity?” as “The theory of relativity was devel- sources, we provide the relation extractor with the
oped by ”. For each question and answer pair, sentences known to express the test facts. Using
we know that the corresponding fact is expressed these datasets, RE constructs a knowledge graph
in Wikipedia since this is how SQuAD was cre- of triples. At test time, we query this graph by
ated. finding the subject entity and then rank all ob-
jects in the correct relation based on the confi-
4.2 Models dence scores returned by RE. We consider two ver-
We consider the following pretrained case- sions of this procedure that differ in how the en-
sensitive language models in our study (see Ta- tity linking is implemented: REn makes use of a
ble 1): fairseq-fconv (Fs), Transformer-XL large naı̈ve entity linking solution based on exact string
(Txl), ELMo original (Eb), ELMo 5.5B (E5B), matching, while REo uses an oracle for entity link-
BERT-base (Bb) and BERT-large (Bl). We use the ing in addition to string matching. In other words,
natural way of generating tokens for each model assume we query for the object o of a test subject-
by following the definition of the training objec- relation fact (s, r, o) expressed in a sentence x. If
tive function. RE has extracted any triple (s0 , r, o0 ) from that sen-
tence x, s0 will be linked to s and o0 to o. In be 0.
practice, this means RE can return the correct so-
lution o if any relation instance of the right type Single Token We only consider single token ob-
was extracted from x, regardless of whether it has jects as our prediction targets. The reason we in-
a wrong subject or object. clude this limitation is that multi-token decoding
adds a number of additional tuneable parameters
DrQA: Chen et al. (2017) introduce DrQA, a pop-
(beam size, candidate scoring weights, length nor-
ular system for open-domain question answering.
malization, n-gram repetition penalties, etc.) that
DrQA predicts answers to natural language ques-
obscure the knowledge we are trying to measure.
tions using a two step pipeline. First, a TF/IDF
Moreover, well-calibrated multi-token generation
information retrieval step is used to find rele-
is still an active research area, particularly for bidi-
vant articles from a large store of documents (e.g.
rectional models (see e.g. Welleck et al. (2019)).
Wikipedia). On the retrieved top k articles, a neu-
ral reading comprehension model then extracts an- Object Slots We choose to only query object
swers. To avoid giving the language models a slots in triples, as opposed to subject or rela-
competitive advantage, we constrain the predic- tion slots. By including reverse relations (e.g.
tions of DrQA to single-token answers. contains and contained-by) we can also query
subject slots. We do not query relation slots for
4.4 Metrics
two reasons. First, surface form realisations of
We consider rank-based metrics and compute re- relations will span several tokens, and as we dis-
sults per relation along with mean values across all cussed above, this poses a technical challenge that
relations. To account for multiple valid objects for is not in the scope of this work. Second, even if
a subject-relation pair (i.e., for N-M relations), we we could easily predict multi-token phrases, rela-
follow Bordes et al. (2013) and remove from the tions can generally be expressed with many dif-
candidates when ranking at test time all other valid ferent wordings, making it unclear what the gold
objects in the training data other than the one we standard pattern for a relation should be, and how
test. We use the mean precision at k (P@k). For to measure accuracy in this context.
a given fact, this value is 1 if the object is ranked
among the top k results, and 0 otherwise. Intersection of Vocabularies The models that
we considered are trained with different vocabu-
4.5 Considerations laries. For instance, ELMo uses a list of ∼800K
There are several important design decisions we tokens while BERT considers only ∼30K tokens.
made when creating the LAMA probe. Below The size of the vocabulary can influence the per-
we give more detailed justifications for these de- formance of a model for the LAMA probe. Specif-
cisions. ically, the larger the vocabulary the harder it would
be to rank the gold token at the top. For this rea-
Manually Defined Templates For each relation son we considered a common vocabulary of ∼21K
we manually define a template that queries for the case-sensitive tokens that are obtained from the
object slot in that relation. One can expect that intersection of the vocabularies for all considered
the choice of templates has an impact on the re- models. To allow a fair comparison, we let every
sults, and this is indeed the case: for some rela- model rank only tokens in this joint vocabulary.
tions we find both worse and better ways to query
for the same information (with respect to a given 5 Results
model) by using an alternate template. We argue We summarize the main results in Table 2, which
that this means we are measuring a lower bound shows the mean precision at one (P@1) for the dif-
for what language models know. We make this ferent models across the set of corpora considered.
argument by analogy with traditional knowledge In the remainder of this section, we discuss the re-
bases: they only have a single way of querying sults for each corpus in detail.
knowledge for a specific relation, namely by us-
ing the relation id of that relation, and this way is Google-RE We query the LMs using a standard
used to measure their accuracy. For example, if cloze template for each relation. The base and
the relation ID is works-For and the user asks for large versions of BERT both outperform all other
is-working-for, the accuracy of the KG would models by a substantial margin. Furthermore, they
Statistics Baselines KB LM
Corpus Relation
#Facts #Rel Freq DrQA REn REo Fs Txl Eb E5B Bb Bl
birth-place 2937 1 4.6 - 3.5 13.8 4.4 2.7 5.5 7.5 14.9 16.1
birth-date 1825 1 1.9 - 0.0 1.9 0.3 1.1 0.1 0.1 1.5 1.4
Google-RE
death-place 765 1 6.8 - 0.1 7.2 3.0 0.9 0.3 1.3 13.1 14.0
Total 5527 3 4.4 - 1.2 7.6 2.6 1.6 2.0 3.0 9.8 10.5
1-1 937 2 1.78 - 0.6 10.0 17.0 36.5 10.1 13.1 68.0 74.5
N-1 20006 23 23.85 - 5.4 33.8 6.1 18.0 3.6 6.5 32.4 34.2
T-REx
N-M 13096 16 21.95 - 7.7 36.7 12.0 16.5 5.7 7.4 24.7 24.3
Total 34039 41 22.03 - 6.1 33.8 8.9 18.3 4.7 7.1 31.1 32.3
ConceptNet Total 11458 16 4.8 - - - 3.6 5.7 6.1 6.2 15.6 19.2
SQuAD Total 305 - - 37.5 - - 3.6 3.9 1.6 4.3 14.1 17.4
Table 2: Mean precision at one (P@1) for a frequency baseline (Freq), DrQA, a relation extraction with naı̈ve
entity linking (REn ), oracle entity linking (REo ), fairseq-fconv (Fs), Transformer-XL large (Txl), ELMo original
(Eb), ELMo 5.5B (E5B), BERT-base (Bb) and BERT-large (Bl) across the set of evaluation corpora.
OM
300
object -0.0048
mentions
0.4 250
LPFP
log probability -0.075 0.074
first prediction
0.2 200
rank
SOCS
subject object
vectors cosine
similarity
0.12 -0.051 0.2 0.0 150
0.2
ST
subject -0.18 0.042 0.052 0.11 100
tokens
0.4
SWP
50
subject -0.23 0.04 0.056 -0.21 0.52
word pieces
0
Fs Txl Eb E5B Bb Bl
P@1 -0.05 0.2 0.42 0.31 0.12 0.035
Figure 4: Average rank distribution for 10 different
SM OM LPFP SOCS ST SWP
mentions of 100 random facts per relation in T-REx.
Figure 3: Pearson correlation coefficient for the P@1 ELMo 5.5B and both variants of BERT are least sen-
of the BERT-large model on T-REx and a set of met- sitive to the framing of the query but also are the most
rics: SM and OM refer to the number of times a sub- likely to have seen the query sentence during training.
ject and an object are mentioned in the BERT training
corpus4 respectively; LPFP is the log probability score
associated with the first prediction; SOCS is the co- ence a higher variability in their predictions. Note
sine similarity between subject and object vectors (we that BERT and ELMo 5.5B have been trained on
use spaCy5 ); ST and SWP are the number of tokens in a larger portion of Wikipedia than fairseq-fconv
the subject with a standard tokenization and the BERT and Transformer-XL and may have seen more sen-
WordPiece tokenization respectively.
tences containing the test queries during training.
P530 Kenya maintains diplomatic relations with . Uganda India [-3.0] , Uganda [-3.2] , Tanzania [-3.5] , China [-3.6] , Pakistan [-3.6]
P176 iPod Touch is produced by . Apple Apple [-1.6] , Nokia [-1.7] , Sony [-2.0] , Samsung [-2.6] , Intel [-3.1]
P30 Bailey Peninsula is located in . Antarctica Antarctica [-1.4] , Bermuda [-2.2] , Newfoundland [-2.5] , Alaska [-2.7] , Canada [-3.1]
P178 JDK is developed by . Oracle IBM [-2.0] , Intel [-2.3] , Microsoft [-2.5] , HP [-3.4] , Nokia [-3.5]
P1412 Carl III used to communicate in . Swedish German [-1.6] , Latin [-1.9] , French [-2.4] , English [-3.0] , Spanish [-3.0]
P17 Sunshine Coast, British Columbia is located in . Canada Canada [-1.2] , Alberta [-2.8] , Yukon [-2.9] , Labrador [-3.4] , Victoria [-3.4]
P39 Pope Clement VII has the position of . pope cardinal [-2.4] , Pope [-2.5] , pope [-2.6] , President [-3.1] , Chancellor [-3.2]
P264 Joe Cocker is represented by music label . Capitol EMI [-2.6] , BMG [-2.6] , Universal [-2.8] , Capitol [-3.2] , Columbia [-3.3]
P276 London Jazz Festival is located in . London London [-0.3] , Greenwich [-3.2] , Chelsea [-4.0] , Camden [-4.6] , Stratford [-4.8]
P127 Border TV is owned by . ITV Sky [-3.1] , ITV [-3.3] , Global [-3.4] , Frontier [-4.1] , Disney [-4.3]
P103 The native language of Mammootty is . Malayalam Malayalam [-0.2] , Tamil [-2.1] , Telugu [-4.8] , English [-5.2] , Hindi [-5.6]
P495 The Sharon Cuneta Show was created in . Philippines Manila [-3.2] , Philippines [-3.6] , February [-3.7] , December [-3.8] , Argentina [-4.0]
AtLocation You are likely to find a overflow in a . drain sewer [-3.1] , canal [-3.2] , toilet [-3.3] , stream [-3.6] , drain [-3.6]
CapableOf Ravens can . fly fly [-1.5] , fight [-1.8] , kill [-2.2] , die [-3.2] , hunt [-3.4]
CausesDesire Joke would make you want to . laugh cry [-1.7] , die [-1.7] , laugh [-2.0] , vomit [-2.6] , scream [-2.6]
ConceptNet
Causes Sometimes virus causes . infection disease [-1.2] , cancer [-2.0] , infection [-2.6] , plague [-3.3] , fever [-3.4]
HasA Birds have . feathers wings [-1.8] , nests [-3.1] , feathers [-3.2] , died [-3.7] , eggs [-3.9]
HasPrerequisite Typing requires . speed patience [-3.5] , precision [-3.6] , registration [-3.8] , accuracy [-4.0] , speed [-4.1]
HasProperty Time is . finite short [-1.7] , passing [-1.8] , precious [-2.9] , irrelevant [-3.2] , gone [-4.0]
MotivatedByGoal You would celebrate because you are . alive happy [-2.4] , human [-3.3] , alive [-3.3] , young [-3.6] , free [-3.9]
ReceivesAction Skills can be . taught acquired [-2.5] , useful [-2.5] , learned [-2.8] , combined [-3.9] , varied [-3.9]
UsedFor A pond is for . fish swimming [-1.3] , fishing [-1.4] , bathing [-2.0] , fish [-2.8] , recreation [-3.1]
Table 3: Examples of generation for BERT-large. The last column reports the top five tokens generated together
with the associated log probability (in square brackets).
that BERT-large is able to recall such knowledge els trained on ever growing corpora might become
better than its competitors and at a level remark- a viable alternative to traditional knowledge bases
ably competitive with non-neural and supervised extracted from text in the future.
alternatives. Note that we did not compare the In addition to testing future pretrained language
ability of the corresponding architectures and ob- models using the LAMA probe, we are interested
jectives to capture knowledge in a given body of in quantifying the variance of recalling factual
text but rather focused on the knowledge present in knowledge with respect to varying natural lan-
the weights of existing pretrained models that are guage templates. Moreover, assessing multi-token
being used as starting points for many researchers’ answers remains an open challenge for our evalu-
work. Understanding which aspects of data our ation setup.
commonly-used models and learning algorithms
are capturing is a crucial field of research and this Acknowledgments
paper complements the many studies focused on
the learned linguistic properties of the data. We would like to thank the reviewers for their
thoughtful comments and efforts towards improv-
We found that it is non-trivial to extract a knowl- ing our manuscript. In addition, we would like
edge base from text that performs on par to di- to acknowledge three frameworks that were used
rectly using pretrained BERT-large. This is de- in our experiments: AllenNLP7 , Fairseq8 and the
spite providing our relation extraction baseline Hugging Face PyTorch-Transformers9 library.
with only data that is likely expressing target facts,
thus reducing potential for false negatives, as well
as using a generous entity-linking oracle. We References
suspected BERT might have an advantage due to
Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward
the larger amount of data it has processed, so we Hughes, Pushmeet Kohli, and Edward Grefenstette.
added Wikitext-103 as additional data to the re- 2019. Learning to understand goal specifications by
lation extraction system and observed no signif-
7
icant change in performance. This suggests that https://github.com/allenai/allennlp
8
https://github.com/pytorch/fairseq
while relation extraction performance might be 9
https://github.com/huggingface/
difficult to improve with more data, language mod- pytorch-transformers
modelling reward. In International Conference on Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Learning Representations (ICLR). Kristina Toutanova. 2018b. BERT: Pre-training
of Deep Bidirectional Transformers for Language
Marco Baroni, Georgiana Dinu, and Germán Understanding. arXiv:1810.04805 [cs]. ArXiv:
Kruszewski. 2014. Don’t count, predict! A 1810.04805.
systematic comparison of context-counting vs.
context-predicting semantic vectors. In Proceedings Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci,
of the 52nd Annual Meeting of the Association for Christophe Gravier, Jonathon Hare, Frederique
Computational Linguistics, ACL 2014, June 22-27, Laforest, and Elena Simperl. 2018. T-rex: A large
2014, Baltimore, MD, USA, Volume 1: Long Papers, scale alignment of natural language with knowledge
pages 238–247. base triples. In Proceedings of the Eleventh Interna-
tional Conference on Language Resources and Eval-
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and uation (LREC-2018).
Christian Janvin. 2003. A neural probabilistic lan-
guage model. Journal of Machine Learning Re- Yoav Goldberg. 2019. Assessing bert’s syntactic abili-
search, 3:1137–1155. ties. CoRR, abs/1901.05287.
Antoine Bordes, Nicolas Usunier, Alberto Garcı́a- Felix Hill, Roi Reichart, and Anna Korhonen. 2015.
Durán, Jason Weston, and Oksana Yakhnenko. Simlex-999: Evaluating semantic models with (gen-
2013. Translating embeddings for modeling multi- uine) similarity estimation. Computational Linguis-
relational data. In Advances in Neural Information tics, 41(4):665–695.
Processing Systems 26: 27th Annual Conference on
Neural Information Processing Systems 2013. Pro- Sepp Hochreiter and Jürgen Schmidhuber. 1997.
ceedings of a meeting held December 5-8, 2013, Long short-term memory. Neural Computation,
Lake Tahoe, Nevada, United States., pages 2787– 9(8):1735–1780.
2795.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia
S. R. K. Branavan, David Silver, and Regina Barzi- Rhinehart, Michael Collins, Ankur Parikh, Chris Al-
lay. 2011. Learning to win by reading manuals in a berti, Danielle Epstein, Illia Polosukhin, Matthew
monte-carlo framework. In The 49th Annual Meet- Kelcey, Jacob Devlin, et al. 2019. Natural questions:
ing of the Association for Computational Linguis- a benchmark for question answering research.
tics: Human Language Technologies, Proceedings
Jelena Luketina, Nantas Nardelli, Gregory Farquhar,
of the Conference, 19-24 June, 2011, Portland, Ore-
Jakob Foerster, Jacob Andreas, Edward Grefen-
gon, USA, pages 268–277.
stette, Shimon Whiteson, and Tim Rocktäschel.
2019. A Survey of Reinforcement Learning In-
Danqi Chen, Adam Fisch, Jason Weston, and Antoine formed by Natural Language. In Proceedings of
Bordes. 2017. Reading wikipedia to answer open- the Twenty-Eighth International Joint Conference
domain questions. CoRR, abs/1704.00051. on Artificial Intelligence, IJCAI 2019, August 10-16
2019, Macao, China.
Maxime Chevalier-Boisvert, Dzmitry Bahdanau,
Salem Lahlou, Lucas Willems, Chitwan Saharia, Rebecca Marvin and Tal Linzen. 2018. Targeted syn-
Thien Huu Nguyen, and Yoshua Bengio. 2018. tactic evaluation of language models. In Proceed-
Babyai: First steps towards grounded language ings of the 2018 Conference on Empirical Methods
learning with a human in the loop. CoRR, in Natural Language Processing, Brussels, Belgium,
abs/1810.08272. October 31 - November 4, 2018, pages 1192–1202.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. R. Thomas McCoy, Ellie Pavlick, and Tal Linzen.
Carbonell, Quoc V. Le, and Ruslan Salakhutdi- 2019. Right for the wrong reasons: Diagnosing syn-
nov. 2019. Transformer-xl: Attentive language tactic heuristics in natural language inference.
models beyond a fixed-length context. CoRR,
abs/1901.02860. Gábor Melis, Chris Dyer, and Phil Blunsom. 2017. On
the state of the art of evaluation in neural language
Yann N. Dauphin, Angela Fan, Michael Auli, and models. CoRR, abs/1707.05589.
David Grangier. 2017. Language modeling with
gated convolutional networks. In Proceedings of the Stephen Merity, Caiming Xiong, James Bradbury, and
34th International Conference on Machine Learn- Richard Socher. 2016. Pointer sentinel mixture
ing, ICML 2017, Sydney, NSW, Australia, 6-11 Au- models. CoRR, abs/1609.07843.
gust 2017, pages 933–941.
Tomas Mikolov and Geoffrey Zweig. 2012. Context
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and dependent recurrent neural network language model.
Kristina Toutanova. 2018a. BERT: pre-training of In 2012 IEEE Spoken Language Technology Work-
deep bidirectional transformers for language under- shop (SLT), Miami, FL, USA, December 2-5, 2012,
standing. CoRR, abs/1810.04805. pages 234–239.
Maximilian Nickel, Kevin Murphy, Volker Tresp, and Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,
Evgeniy Gabrilovich. 2016. A review of relational Adam Poliak, R Thomas McCoy, Najoung Kim,
machine learning for knowledge graphs. Proceed- Benjamin Van Durme, Sam Bowman, Dipanjan Das,
ings of the IEEE, 104(1):11–33. and Ellie Pavlick. 2019. What do you learn from
context? probing for sentence structure in contextu-
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt alized word representations. In International Con-
Gardner, Christopher Clark, Kenton Lee, and Luke ference on Learning Representations.
Zettlemoyer. 2018a. Deep contextualized word rep-
resentations. In Proceedings of the 2018 Confer- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
ence of the North American Chapter of the Associ- Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
ation for Computational Linguistics: Human Lan- Kaiser, and Illia Polosukhin. 2017. Attention is all
guage Technologies, NAACL-HLT 2018, New Or- you need. In Advances in Neural Information Pro-
leans, Louisiana, USA, June 1-6, 2018, Volume 1 cessing Systems 30: Annual Conference on Neural
(Long Papers), pages 2227–2237. Information Processing Systems 2017, 4-9 Decem-
ber 2017, Long Beach, CA, USA, pages 6000–6010.
Matthew E. Peters, Mark Neumann, Luke Zettlemoyer,
and Wen-tau Yih. 2018b. Dissecting contextual Alex Wang, Amanpreet Singh, Julian Michael, Fe-
word embeddings: Architecture and representation. lix Hill, Omer Levy, and Samuel R. Bowman.
In Proceedings of the 2018 Conference on Empirical 2018. GLUE: A multi-task benchmark and anal-
Methods in Natural Language Processing, Brussels, ysis platform for natural language understand-
Belgium, October 31 - November 4, 2018, pages ing. In Proceedings of the Workshop: Analyz-
1499–1509. ing and Interpreting Neural Networks for NLP,
Alec Radford, Karthik Narasimhan, Tim Salimans, and BlackboxNLP@EMNLP 2018, Brussels, Belgium,
Ilya Sutskever. 2018. Improving language under- November 1, 2018, pages 353–355.
standing by generative pre-training. Sean Welleck, Kianté Brantley, Hal Daumé III, and
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Kyunghyun Cho. 2019. Non-monotonic sequential
Dario Amodei, and Ilya Sutskever. 2019. Language text generation. arXiv preprint arXiv:1902.02192.
models are unsupervised multitask learners.
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and 2014. Recurrent neural network regularization.
Percy Liang. 2016. SQuAD: 100,000+ Questions CoRR, abs/1409.2329.
for Machine Comprehension of Text. In Proceed-
ings of the 2016 Conference on Empirical Methods Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin
in Natural Language Processing, pages 2383–2392, Choi. 2018. From recognition to cognition: Visual
Austin, Texas. Association for Computational Lin- commonsense reasoning. CoRR, abs/1811.10830.
guistics.
Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan
Siva Reddy, Danqi Chen, and Christopher D. Manning. Salakhutdinov, Raquel Urtasun, Antonio Torralba,
2018. Coqa: A conversational question answering and Sanja Fidler. 2015. Aligning books and movies:
challenge. CoRR, abs/1808.07042. Towards story-like visual explanations by watching
movies and reading books. In 2015 IEEE Interna-
Daniil Sorokin and Iryna Gurevych. 2017. Context- tional Conference on Computer Vision, ICCV 2015,
aware representations for knowledge base relation Santiago, Chile, December 7-13, 2015, pages 19–
extraction. In Proceedings of the 2017 Confer- 27.
ence on Empirical Methods in Natural Language
Processing, EMNLP 2017, Copenhagen, Denmark,
September 9-11, 2017, pages 1784–1789.
Robert Speer and Catherine Havasi. 2012. Represent-
ing general relational knowledge in conceptnet 5. In
LREC, pages 3679–3686.
Mihai Surdeanu and Heng Ji. 2014. Overview of the
English Slot Filling Track at the TAC2014 Knowl-
edge Base Population Evaluation. page 15.
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and
Jonathan Berant. 2019. Commonsenseqa: A ques-
tion answering challenge targeting commonsense
knowledge. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, NAACL-HLT 2019, Minneapolis, MN,
USA, June 2-7, 2019, Volume 1 (Long and Short Pa-
pers), pages 4149–4158.