0% found this document useful (0 votes)
28 views11 pages

Language Models As Knowledge Bases?

1) The document discusses analyzing the relational knowledge contained in pretrained language models without any fine-tuning. 2) It finds that BERT contains competitive relational knowledge compared to traditional NLP methods with access to an oracle knowledge base, and also performs well on open-domain question answering compared to a supervised baseline. 3) Certain types of factual knowledge are learned more readily than others by standard language model pretraining, such as facts about entities versus more complex N-to-M relations.

Uploaded by

stark_93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views11 pages

Language Models As Knowledge Bases?

1) The document discusses analyzing the relational knowledge contained in pretrained language models without any fine-tuning. 2) It finds that BERT contains competitive relational knowledge compared to traditional NLP methods with access to an oracle knowledge base, and also performs well on open-domain question answering compared to a supervised baseline. 3) Certain types of factual knowledge are learned more readily than others by standard language model pretraining, such as facts about entities versus more complex N-to-M relations.

Uploaded by

stark_93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Language Models as Knowledge Bases?

Fabio Petroni1 Tim Rocktäschel1,2 Patrick Lewis1,2 Anton Bakhtin1


Yuxiang Wu1,2 Alexander H. Miller1 Sebastian Riedel1,2
1
Facebook AI Research
2
University College London
{fabiopetroni, rockt, plewis, yolo, yuxiangwu, ahm, sriedel}@fb.com

Memory Query Answer


Abstract
(Dante, born-in, X)
Recent progress in pretraining language mod-
els on large textual corpora led to a surge Symbolic
KG Dante Florence
Memory Access
arXiv:1909.01066v2 [cs.CL] 4 Sep 2019

of improvements for downstream NLP tasks. born-in

Whilst learning linguistic knowledge, these Florence

models may also be storing relational knowl-


edge present in the training data, and may “Dante was born in [Mask].”
be able to answer queries structured as “fill- Neural LM
in-the-blank” cloze statements. Language LM Florence
Memory Access
models have many advantages over structured
knowledge bases: they require no schema en- e.g. ELMo/BERT
gineering, allow practitioners to query about
an open class of relations, are easy to extend to Figure 1: Querying knowledge bases (KB) and lan-
more data, and require no human supervision guage models (LM) for factual knowledge.
to train. We present an in-depth analysis of the
relational knowledge already present (without
fine-tuning) in a wide range of state-of-the- vast amounts of linguistic knowledge (Peters et al.,
art pretrained language models. We find that 2018b; Goldberg, 2019; Tenney et al., 2019) use-
(i) without fine-tuning, BERT contains rela- ful for downstream tasks. This knowledge is
tional knowledge competitive with traditional
usually accessed either by conditioning on latent
NLP methods that have some access to ora-
cle knowledge, (ii) BERT also does remark- context representations produced by the original
ably well on open-domain question answer- model or by using the original model weights to
ing against a supervised baseline, and (iii) cer- initialize a task-specific model which is then fur-
tain types of factual knowledge are learned ther fine-tuned. This type of knowledge transfer
much more readily than others by standard lan- is crucial for current state-of-the-art results on a
guage model pretraining approaches. The sur- wide range of tasks.
prisingly strong ability of these models to re-
call factual knowledge without any fine-tuning In contrast, knowledge bases are effective so-
demonstrates their potential as unsupervised lutions for accessing annotated gold-standard re-
open-domain QA systems. The code to re- lational data by enabling queries such as (Dante,
produce our analysis is available at https: born-in, X). However, in practice we often need
//github.com/facebookresearch/LAMA. to extract relational data from text or other modal-
ities to populate these knowledge bases. This
1 Introduction
requires complex NLP pipelines involving entity
Recently, pretrained high-capacity language mod- extraction, coreference resolution, entity linking
els such as ELMo (Peters et al., 2018a) and BERT and relation extraction (Surdeanu and Ji, 2014)—
(Devlin et al., 2018a) have become increasingly components that often need supervised data and
important in NLP. They are optimised to either fixed schemas. Moreover, errors can easily prop-
predict the next word in a sequence or some agate and accumulate throughout the pipeline. In-
masked word anywhere in a given sequence (e.g. stead, we could attempt to query neural language
“Dante was born in [Mask] in the year 1265.”). models for relational data by asking them to fill in
The parameters of these models appear to store masked tokens in sequences like “Dante was born
in [Mask]”, as illustrated in Figure 1. In this set- els, however, for some relations (particularly
ting, language models come with various attractive N-to-M relations) performance is very poor,
properties: they require no schema engineering, (iii) BERT-large consistently outperforms other
do not need human annotations, and they support language models in recovering factual and com-
an open set of queries. monsense knowledge while at the same time
Given the above qualities of language models as being more robust to the phrasing of a query, and
potential representations of relational knowledge, (iv) BERT-large achieves remarkable results for
we are interested in the relational knowledge al- open-domain QA, reaching 57.1% precision@10
ready present in pretrained off-the-shelf language compared to 63.5% of a knowledge base con-
models such as ELMo and BERT. How much re- structed using a task-specific supervised relation
lational knowledge do they store? How does this extraction system.
differ for different types of knowledge such as
2 Background
facts about entities, common sense, and general
question answering? How does their performance In this section we provide background on language
without fine-tuning compare to symbolic knowl- models. Statistics for the models that we include
edge bases automatically extracted from text? in our investigation are summarized in Table 1.
Beyond gathering a better general understand-
ing of these models, we believe that answers to 2.1 Unidirectional Language Models
these questions can help us design better unsuper- Given an input sequence of tokens w =
vised knowledge representations that could trans- [w1 , w2 , . . . , wN ], unidirectional language models
fer factual and commonsense knowledge reliably commonly assign a probability p(w) to the se-
to downstream tasks such as commonsense (vi- quence by factorizing it as follows
sual) question answering (Zellers et al., 2018; Tal- Y
mor et al., 2019) or reinforcement learning (Brana- p(w) = p(wt | wt−1 , . . . , w1 ). (1)
van et al., 2011; Chevalier-Boisvert et al., 2018; t
Bahdanau et al., 2019; Luketina et al., 2019).
A common way to estimate this probability is us-
For the purpose of answering the above ques- ing neural language models (Mikolov and Zweig,
tions we introduce the LAMA (LAnguage Model 2012; Melis et al., 2017; Bengio et al., 2003) with
Analysis) probe, consisting of a set of knowledge
sources, each comprised of a set of facts. We p(wt | wt−1 , . . . , w1 ) = softmax(Wht + b) (2)
define that a pretrained language model knows a
fact (subject, relation, object) such as (Dante, where ht ∈ Rk is the output vector of a neural net-
born-in, Florence) if it can successfully predict work at position t and W ∈ R|V| × k is a learned
masked objects in cloze sentences such as “Dante parameter matrix that maps ht to unnormalized
was born in ” expressing that fact. We test scores for every word in the vocabulary V. Var-
for a variety of types of knowledge: relations be- ious neural language models then mainly differ in
tween entities stored in Wikidata, common sense how they compute ht given the word history, e.g.,
relations between concepts from ConceptNet, and by using a multi-layer perceptron (Bengio et al.,
knowledge necessary to answer natural language 2003; Mikolov and Zweig, 2012), convolutional
questions in SQuAD. In the latter case we man- layers (Dauphin et al., 2017), recurrent neural net-
ually map a subset of SQuAD questions to cloze works (Zaremba et al., 2014; Merity et al., 2016;
sentences. Melis et al., 2017) or self-attention mechanisms
(Radford et al., 2018; Dai et al., 2019; Radford
Our investigation reveals that (i) the largest
et al., 2019).
BERT model from Devlin et al. (2018b)
fairseq-fconv: Instead of commonly used recur-
(BERT-large) captures (accurate) relational
rent neural networks, Dauphin et al. (2017) use
knowledge comparable to that of a knowledge
multiple layers of gated convolutions. We use
base extracted with an off-the-shelf relation
the pretrained model in the fairseq1 library in our
extractor and an oracle-based entity linker from
study. It has been trained on the WikiText-103 cor-
a corpus known to express the relevant knowl-
pus introduced by Merity et al. (2016).
edge, (ii) factual knowledge can be recovered
surprisingly well from pretrained language mod- 1
https://github.com/pytorch/fairseq
Model Base Model #Parameters Training Corpus Corpus Size
fairseq-fconv (Dauphin et al., 2017) ConvNet 324M WikiText-103 103M Words
Transformer-XL (large) (Dai et al., 2019) Transformer 257M WikiText-103 103M Words
ELMo (original) (Peters et al., 2018a) BiLSTM 93.6M Google Billion Word 800M Words
ELMo 5.5B (Peters et al., 2018a) BiLSTM 93.6M Wikipedia (en) & WMT 2008-2012 5.5B Words
BERT (base) (Devlin et al., 2018a) Transformer 110M Wikipedia (en) & BookCorpus 3.3B Words
BERT (large) (Devlin et al., 2018a) Transformer 340M Wikipedia (en) & BookCorpus 3.3B Words

Table 1: Language models considered in this study.

Transformer-XL: Dai et al. (2019) introduce a tion to this pseudo language model objective, they
large-scale language model based on the Trans- use an auxiliary binary classification objective to
former (Vaswani et al., 2017). Transformer-XL predict whether a particular sentence follows the
can take into account a longer history by caching given sequence of words.
previous outputs and by using relative instead of
absolute positional encoding. It achieves a test 3 Related Work
perplexity of 18.3 on the WikiText-103 corpus.
Many studies have investigated pretrained word
2.2 Bidirectional “Language Models”2 representations, sentence representations, and lan-
So far, we have looked at language models that guage models. Existing work focuses on un-
predict the next word given a history of words. derstanding linguistic and semantic properties of
However, in many downstream applications we word representations or how well pretrained sen-
mostly care about having access to contextual rep- tence representations and language models trans-
resentations of words, i.e., word representations fer linguistic knowledge to downstream tasks. In
that are a function of the entire context of a unit contrast, our investigation seeks to answer to what
of text such as a sentence or paragraph, and not extent pretrained language models store factual
only conditioned on previous words. Formally, and commonsense knowledge by comparing them
given an input sequence w = [w1 , w2 , . . . , wN ] with symbolic knowledge bases populated by tra-
and a position 1 ≤ i ≤ N, we want to esti- ditional relation extraction approaches.
mate p(wi ) = p(wi | w1 , . . . , wi−1 , wi+1 , . . . , wN ) us- Baroni et al. (2014) present a systematic com-
ing the left and right context of that word. parative analysis between neural word represen-
ELMo: To estimate this probability, Peters et al. tation methods and more traditional count-based
(2018a) propose running a forward and backward distributional semantic methods on lexical seman-
LSTM (Hochreiter and Schmidhuber, 1997), re- tics tasks like semantic relatedness and concept

− ←− categorization. They find that neural word rep-
sulting in h i and h i which consequently are used
to calculate a forward and backward language resentations outperform count-based distributional
model log-likelihood. Their model, ELMo, uses methods on the majority of the considered tasks.
multiple layers of LSTMs and it has been pre- Hill et al. (2015) investigate to what degree word
trained on the Google Billion Word dataset. An- representations capture semantic meaning as mea-
other version of the model, ELMo 5.5B, has been sured by similarity between word pairs.
trained on the English Wikipedia and monolingual Marvin and Linzen (2018) assess the gram-
news crawl data from WMT 2008-2012. maticality of pretrained language models. Their
BERT: Instead of a standard language model ob- dataset consists of sentence pairs with a grammat-
jective, Devlin et al. (2018a) propose to sample ical and an ungrammatical sentence. While a good
positions in the input sequence randomly and to language model should assign higher probability
learn to fill the word at the masked position. To to the grammatical sentence, they find that LSTMs
this end, they employ a Transformer architecture do not learn syntax well.
and train it on the BookCorpus (Zhu et al., 2015) Another line of work investigates the ability of
as well as a crawl of English Wikipedia. In addi- pretrained sentence and language models to trans-
2
fer knowledge to downstream natural language un-
Contextual representation models (Tenney et al., 2019)
might be a better name, but we keep calling them language derstanding tasks (Wang et al., 2018). While such
models for simplicity. an analysis sheds light on the transfer-learning
abilities of pretrained models for understanding 4 The LAMA Probe
short pieces of text, it provides little insight into
whether these models can compete with traditional We introduce the LAMA (LAnguage Model Anal-
approaches to representing knowledge like sym- ysis) probe to test the factual and commonsense
bolic knowledge bases. knowledge in language models. It provides a set
of knowledge sources which are composed of a
More recently, McCoy et al. (2019) found that corpus of facts. Facts are either subject-relation-
for natural language inference, a model based on object triples or question-answer pairs. Each fact
BERT learns to rely heavily on fallible syntac- is converted into a cloze statement which is used to
tic heuristics instead of a deeper understanding of query the language model for a missing token. We
the natural language input. Peters et al. (2018b) evaluate each model based on how highly it ranks
found that lower layers in ELMo specialize on lo- the ground truth token against every other word
cal syntactic relationships, while higher layers can in a fixed candidate vocabulary. This is similar
learn to model long-range relationships. Similarly, to ranking-based metrics from the knowledge base
Goldberg (2019) found that BERT captures En- completion literature (Bordes et al., 2013; Nickel
glish syntactic phenomena remarkably well. Ten- et al., 2016). Our assumption is that models which
ney et al. (2019) investigate to what extent lan- rank ground truth tokens high for these cloze state-
guage models encode sentence structure for differ- ments have more factual knowledge. We discuss
ent syntactic and semantic phenomena and found each step in detail next and provide considerations
that they excel for the former but only provide on the probe below.
small improvements for tasks that fall into the lat-
4.1 Knowledge Sources
ter category. While this provides insights into the
linguistic knowledge of language models, it does To assess the different language models in Sec-
not provide insights into their factual and com- tion 2, we cover a variety of sources of factual
monsense knowledge. and commonsense knowledge. For each source,
we describe the origin of fact triples (or question-
Radford et al. (2018) introduce a pretrained lan- answer pairs), how we transform them into cloze
guage model based on the Transformer which they templates, and to what extent aligned texts exist
termed generative pretraining (GPTv1). The first in Wikipedia that are known to express a partic-
version of GPT (Radford et al., 2018) has been ular fact. We use the latter information in super-
trained on the Book Corpus (Zhu et al., 2015) con- vised baselines that extract knowledge representa-
taining 7000 books. The closest to our investiga- tions directly from the aligned text.
tion is the work by Radford et al. (2019) which
4.1.1 Google-RE
introduces GPTv2 and investigates how well their
language model does zero-shot transfer to a range The Google-RE corpus3 contains ∼60K facts man-
of downstream tasks. They find that GPTv2 ually extracted from Wikipedia. It covers five re-
achieves an F1 of 55 for answering questions in lations but we consider only three of them, namely
CoQA (Reddy et al., 2018) and 4.1% accuracy on “place of birth”, “date of birth” and “place of
the Natural Questions dataset (Kwiatkowski et al., death”. We exclude the other two because they
2019), in both cases without making use of anno- contain mainly multi-tokens objects that are not
tated question-answer pairs or an information re- supported in our evaluation. We manually define
trieval step. While these results are encouraging a template for each considered relation, e.g., “[S]
and hint at the ability of very large pretrained lan- was born in [O]” for “place of birth”. Each fact
guage models to memorize factual knowledge, the in the Google-RE dataset is, by design, manually
large GPTv2 model has not been made public and aligned to a short piece of Wikipedia text support-
the publicly available small version achieves less ing it.
than 1% on Natural Questions (5.3 times worse
4.1.2 T-REx
than the large model). Thus, we decided to not
include GPTv2 in our study. Similarly, we do not The T-REx knowledge source is a subset of
include GPTv1 in this study as it uses a limited Wikidata triples. It is derived from the T-REx
lower-cased vocabulary, making it incompatible to 3
https://code.google.com/archive/p/
the way we assess the other language models. relation-extraction-corpus/
dataset (Elsahar et al., 2018) and is much larger Assume we want to compute the generation for
than Google-RE with a broader set of relations. the token at position t. For unidirectional language
We consider 41 Wikidata relations and subsam- models, we use the network output (ht−1 ) just be-
ple at most 1000 facts per relation. As with the fore the token to produce the output layer soft-
Google-RE corpus, we manually define a tem- max. For ELMo we consider the output just be-
plate for each relation (see Table 3 for some ex- →

fore ( h t−1 ) for the forward direction and just after
amples). In contrast to the Google-RE knowledge ←−
( h t+1 ) for the backward direction. Following the
source, T-REx facts were automatically aligned to loss definition in (Peters et al., 2018a), we average
Wikipedia and hence this alignment can be noisy. forward and backward probabilities from the cor-
However, Elsahar et al. (2018) report an accuracy responding softmax layers. For BERT, we mask
of 97.8% for the alignment technique over a test the token at position t, and we feed the output vec-
set. tor corresponding to the masked token (ht ) into the
4.1.3 ConceptNet softmax layer. To allow a fair comparison, we let
models generate over a unified vocabulary, which
ConceptNet (Speer and Havasi, 2012) is a multi-
is the intersection of the vocabularies for all con-
lingual knowledge base, initially built on top of
sidered models (∼21K case-sensitive tokens).
Open Mind Common Sense (OMCS) sentences.
OMCS represents commonsense relationships be- 4.3 Baselines
tween words and/or phrases. We consider facts
from the English part of ConceptNet that have To compare language models to canonical ways
single-token objects covering 16 relations. For of using off-the-shelf systems for extracting sym-
these ConceptNet triples, we find the OMCS sen- bolic knowledge and answering questions, we
tence that contains both the subject and the object. consider the following baselines.
We then mask the object within the sentence and Freq: For a subject and relation pair, this baseline
use the sentence as template for querying language ranks words based on how frequently they appear
models. If there are several sentences for a triple, as objects for the given relation in the test data. It
we pick one at random. Note that for this knowl- indicates the upper bound performance of a model
edge source there is no explicit alignment of facts that always predicts the same objects for a partic-
to Wikipedia sentences. ular relation.
RE: For the relation-based knowledge sources, we
4.1.4 SQuAD consider the pretrained Relation Extraction (RE)
SQuAD (Rajpurkar et al., 2016) is a popular ques- model of Sorokin and Gurevych (2017). This
tion answering dataset. We select a subset of 305 model was trained on a subcorpus of Wikipedia
context-insensitive questions from the SQuAD de- annotated with Wikidata relations. It extracts rela-
velopment set with single token answers. We man- tion triples from a given sentence using an LSTM-
ually create cloze-style questions from these ques- based encoder and an attention mechanism. Based
tions, e.g., rewriting “Who developed the theory of on the alignment information from the knowledge
relativity?” as “The theory of relativity was devel- sources, we provide the relation extractor with the
oped by ”. For each question and answer pair, sentences known to express the test facts. Using
we know that the corresponding fact is expressed these datasets, RE constructs a knowledge graph
in Wikipedia since this is how SQuAD was cre- of triples. At test time, we query this graph by
ated. finding the subject entity and then rank all ob-
jects in the correct relation based on the confi-
4.2 Models dence scores returned by RE. We consider two ver-
We consider the following pretrained case- sions of this procedure that differ in how the en-
sensitive language models in our study (see Ta- tity linking is implemented: REn makes use of a
ble 1): fairseq-fconv (Fs), Transformer-XL large naı̈ve entity linking solution based on exact string
(Txl), ELMo original (Eb), ELMo 5.5B (E5B), matching, while REo uses an oracle for entity link-
BERT-base (Bb) and BERT-large (Bl). We use the ing in addition to string matching. In other words,
natural way of generating tokens for each model assume we query for the object o of a test subject-
by following the definition of the training objec- relation fact (s, r, o) expressed in a sentence x. If
tive function. RE has extracted any triple (s0 , r, o0 ) from that sen-
tence x, s0 will be linked to s and o0 to o. In be 0.
practice, this means RE can return the correct so-
lution o if any relation instance of the right type Single Token We only consider single token ob-
was extracted from x, regardless of whether it has jects as our prediction targets. The reason we in-
a wrong subject or object. clude this limitation is that multi-token decoding
adds a number of additional tuneable parameters
DrQA: Chen et al. (2017) introduce DrQA, a pop-
(beam size, candidate scoring weights, length nor-
ular system for open-domain question answering.
malization, n-gram repetition penalties, etc.) that
DrQA predicts answers to natural language ques-
obscure the knowledge we are trying to measure.
tions using a two step pipeline. First, a TF/IDF
Moreover, well-calibrated multi-token generation
information retrieval step is used to find rele-
is still an active research area, particularly for bidi-
vant articles from a large store of documents (e.g.
rectional models (see e.g. Welleck et al. (2019)).
Wikipedia). On the retrieved top k articles, a neu-
ral reading comprehension model then extracts an- Object Slots We choose to only query object
swers. To avoid giving the language models a slots in triples, as opposed to subject or rela-
competitive advantage, we constrain the predic- tion slots. By including reverse relations (e.g.
tions of DrQA to single-token answers. contains and contained-by) we can also query
subject slots. We do not query relation slots for
4.4 Metrics
two reasons. First, surface form realisations of
We consider rank-based metrics and compute re- relations will span several tokens, and as we dis-
sults per relation along with mean values across all cussed above, this poses a technical challenge that
relations. To account for multiple valid objects for is not in the scope of this work. Second, even if
a subject-relation pair (i.e., for N-M relations), we we could easily predict multi-token phrases, rela-
follow Bordes et al. (2013) and remove from the tions can generally be expressed with many dif-
candidates when ranking at test time all other valid ferent wordings, making it unclear what the gold
objects in the training data other than the one we standard pattern for a relation should be, and how
test. We use the mean precision at k (P@k). For to measure accuracy in this context.
a given fact, this value is 1 if the object is ranked
among the top k results, and 0 otherwise. Intersection of Vocabularies The models that
we considered are trained with different vocabu-
4.5 Considerations laries. For instance, ELMo uses a list of ∼800K
There are several important design decisions we tokens while BERT considers only ∼30K tokens.
made when creating the LAMA probe. Below The size of the vocabulary can influence the per-
we give more detailed justifications for these de- formance of a model for the LAMA probe. Specif-
cisions. ically, the larger the vocabulary the harder it would
be to rank the gold token at the top. For this rea-
Manually Defined Templates For each relation son we considered a common vocabulary of ∼21K
we manually define a template that queries for the case-sensitive tokens that are obtained from the
object slot in that relation. One can expect that intersection of the vocabularies for all considered
the choice of templates has an impact on the re- models. To allow a fair comparison, we let every
sults, and this is indeed the case: for some rela- model rank only tokens in this joint vocabulary.
tions we find both worse and better ways to query
for the same information (with respect to a given 5 Results
model) by using an alternate template. We argue We summarize the main results in Table 2, which
that this means we are measuring a lower bound shows the mean precision at one (P@1) for the dif-
for what language models know. We make this ferent models across the set of corpora considered.
argument by analogy with traditional knowledge In the remainder of this section, we discuss the re-
bases: they only have a single way of querying sults for each corpus in detail.
knowledge for a specific relation, namely by us-
ing the relation id of that relation, and this way is Google-RE We query the LMs using a standard
used to measure their accuracy. For example, if cloze template for each relation. The base and
the relation ID is works-For and the user asks for large versions of BERT both outperform all other
is-working-for, the accuracy of the KG would models by a substantial margin. Furthermore, they
Statistics Baselines KB LM
Corpus Relation
#Facts #Rel Freq DrQA REn REo Fs Txl Eb E5B Bb Bl
birth-place 2937 1 4.6 - 3.5 13.8 4.4 2.7 5.5 7.5 14.9 16.1
birth-date 1825 1 1.9 - 0.0 1.9 0.3 1.1 0.1 0.1 1.5 1.4
Google-RE
death-place 765 1 6.8 - 0.1 7.2 3.0 0.9 0.3 1.3 13.1 14.0
Total 5527 3 4.4 - 1.2 7.6 2.6 1.6 2.0 3.0 9.8 10.5
1-1 937 2 1.78 - 0.6 10.0 17.0 36.5 10.1 13.1 68.0 74.5
N-1 20006 23 23.85 - 5.4 33.8 6.1 18.0 3.6 6.5 32.4 34.2
T-REx
N-M 13096 16 21.95 - 7.7 36.7 12.0 16.5 5.7 7.4 24.7 24.3
Total 34039 41 22.03 - 6.1 33.8 8.9 18.3 4.7 7.1 31.1 32.3
ConceptNet Total 11458 16 4.8 - - - 3.6 5.7 6.1 6.2 15.6 19.2
SQuAD Total 305 - - 37.5 - - 3.6 3.9 1.6 4.3 14.1 17.4

Table 2: Mean precision at one (P@1) for a frequency baseline (Freq), DrQA, a relation extraction with naı̈ve
entity linking (REn ), oracle entity linking (REo ), fairseq-fconv (Fs), Transformer-XL large (Txl), ELMo original
(Eb), ELMo 5.5B (E5B), BERT-base (Bb) and BERT-large (Bl) across the set of evaluation corpora.

obtain a 2.2 and 2.9 respective average accuracy Fs E5B


80 Txl Bb
improvement over the oracle-based RE baseline. Eb Bl
70
This is particularly surprising given that with the
60
gold-aligned Google-RE source we know for cer-
mean P@k

tain that the oracle RE baseline has seen at least 50


one sentence expressing each test fact. Moreover, 40
the RE baseline was given substantial help through 30
an entity linking oracle. 20
It is worth pointing out that while BERT-large 10
does better, this does not mean it does so for the 0 0
right reasons. Although the aligned Google-RE 10 101 102
k
sentences are likely in its training set (as they
are part of Wikipedia and BERT has been trained Figure 2: Mean P@k curve for T-REx varying k. Base-
10 log scale for X axis.
on Wikipedia), it might not “understand” them
to produce these results. Instead, it could have
learned associations of objects with subjects from about the correct answer can be extracted from the
co-occurrence patterns. output representation). Figure 2 shows the mean
T-REx The knowledge source derived from P@k curves for the considered models. For BERT,
Google-RE contains relatively few facts and only the correct object is ranked among the top ten in
three relations. Hence, we perform experiments around 60% of the cases and among the top 100 in
on the larger set of facts and relations in T-REx. 80% of the cases.
We find that results are generally consistent with To further investigate why BERT achieves such
Google-RE. Again, the performance of BERT in strong results, we compute the Pearson correlation
retrieving factual knowledge are close to the per- coefficient between the P@1 and a set of metrics
formance obtained by automatically building a that we report in Figure 3. We notice, for instance,
knowledge base with an off-the-shelf relation ex- that the number of times an object is mentioned
traction system and oracle-based entity linking. in the training data positively correlates with per-
Broken down by relation type, the performance of formance while the same is not true for the sub-
BERT is very high for 1-to-1 relations (e.g., capi- ject of a relation. Furthermore, the log probabil-
tal of ) and low for N-to-M relations. ity of a prediction is strongly positively correlated
Note that a downstream model could learn to with P@1. Thus, when BERT has a high confi-
make use of knowledge in the output representa- dence in its prediction, it is often correct. Perfor-
tions of a language model even if the correct an- mance is also positively correlated with the cosine
swer is not ranked first but high enough (i.e. a hint similarity between subject and object vectors, and
SM
subject 350
mentions

OM
300
object -0.0048
mentions
0.4 250
LPFP
log probability -0.075 0.074
first prediction
0.2 200

rank
SOCS
subject object
vectors cosine
similarity
0.12 -0.051 0.2 0.0 150
0.2
ST
subject -0.18 0.042 0.052 0.11 100
tokens
0.4
SWP
50
subject -0.23 0.04 0.056 -0.21 0.52
word pieces
0
Fs Txl Eb E5B Bb Bl
P@1 -0.05 0.2 0.42 0.31 0.12 0.035
Figure 4: Average rank distribution for 10 different
SM OM LPFP SOCS ST SWP
mentions of 100 random facts per relation in T-REx.
Figure 3: Pearson correlation coefficient for the P@1 ELMo 5.5B and both variants of BERT are least sen-
of the BERT-large model on T-REx and a set of met- sitive to the framing of the query but also are the most
rics: SM and OM refer to the number of times a sub- likely to have seen the query sentence during training.
ject and an object are mentioned in the BERT training
corpus4 respectively; LPFP is the log probability score
associated with the first prediction; SOCS is the co- ence a higher variability in their predictions. Note
sine similarity between subject and object vectors (we that BERT and ELMo 5.5B have been trained on
use spaCy5 ); ST and SWP are the number of tokens in a larger portion of Wikipedia than fairseq-fconv
the subject with a standard tokenization and the BERT and Transformer-XL and may have seen more sen-
WordPiece tokenization respectively.
tences containing the test queries during training.

ConceptNet The results on the ConceptNet cor-


slightly with the number of tokens in the subject.
pus are in line with those reported for retriev-
Table 3 shows randomly picked examples for
ing factual knowledge in Google-RE and T-REx.
the generation of BERT-large for cloze template
The BERT-large model consistently achieves the
queries. We find that BERT-large generally pre-
best performance, and it is able to retrieve com-
dicts objects of the correct type, even when the
monsense knowledge at a similar level to factual
predicted object itself is not correct.
knowledge. The lower half of Table 3 shows gen-
To understand how the performance of a pre-
erations by BERT-large for randomly sampled ex-
trained language model varies with different ways
amples. Some of the concepts generated by the
of querying for a particular fact, we analyze a
language models are surprisingly reasonable in ad-
maximum of 100 random facts per relation for
dition to being syntactically correct.
which we randomly select 10 aligned sentences in
Wikipedia from T-REx.6 In each of the sentences, SQuAD Next we evaluate our system on open-
we mask the object of the fact, and ask the model domain cloze-style question answering and com-
to predict it. For several of our language models pare against the supervised DrQA model. Table
this also tests their ability to memorize and recall 2 shows a performance gap between BERT-large
sentences from the training data since as the mod- and the DrQA open-domain QA system on our
els have been trained on Wikipedia (see Table 1). cloze SQuAD task. Again, note that the pretrained
Figure 4 shows the average distribution of the language model is completely unsupervised, it is
rank for ten queries per fact. The two BERT mod- not fine-tuned, and it has no access to a ded-
els and ELMo 5.5B exhibit the lowest variabil- icated information retrieval system. Moreover,
ity while ranking the correct object close to the when comparing DrQA and BERT-large in terms
top on average. Surprisingly, the performance of of P@10, we find that gap is remarkably small
ELMo original is not far from BERT, even though (57.1 for BERT-large and 63.5 for DrQA).
this model did not see Wikipedia during train-
ing. Fairseq-fconv and Transformer-XL experi- 6 Discussion and Conclusion
5
The original training corpus is not available, we created We presented a systematic analysis of the factual
our version using the same sources.
5
https://spacy.io and commonsense knowledge in publicly avail-
6
We exclude all facts with less than 10 alignments. able pretrained language models as is and found
Relation Query Answer Generation
P19 Francesco Bartolomeo Conti was born in . Florence Rome [-1.8] , Florence [-1.8] , Naples [-1.9] , Milan [-2.4] , Bologna [-2.5]
P20 Adolphe Adam died in . Paris Paris [-0.5] , London [-3.5] , Vienna [-3.6] , Berlin [-3.8] , Brussels [-4.0]
P279 English bulldog is a subclass of . dog dogs [-0.3] , breeds [-2.2] , dog [-2.4] , cattle [-4.3] , sheep [-4.5]
P37 The official language of Mauritius is . English English [-0.6] , French [-0.9] , Arabic [-6.2] , Tamil [-6.7] , Malayalam [-7.0]
P413 Patrick Oboya plays in position. midfielder centre [-2.0] , center [-2.2] , midfielder [-2.4] , forward [-2.4] , midfield [-2.7]
P138 Hamburg Airport is named after . Hamburg Hess [-7.0] , Hermann [-7.1] , Schmidt [-7.1] , Hamburg [-7.5] , Ludwig [-7.5]
P364 The original language of Mon oncle Benjamin is . French French [-0.2] , Breton [-3.3] , English [-3.8] , Dutch [-4.2] , German [-4.9]
P54 Dani Alves plays with . Barcelona Santos [-2.4] , Porto [-2.5] , Sporting [-3.1] , Brazil [-3.3] , Portugal [-3.7]
P106 Paul Toungui is a by profession . politician lawyer [-1.1] , journalist [-2.4] , teacher [-2.7] , doctor [-3.0] , physician [-3.7]
P527 Sodium sulfide consists of . sodium water [-1.2] , sulfur [-1.7] , sodium [-2.5] , zinc [-2.8] , salt [-2.9]
P102 Gordon Scholes is a member of the political party. Labor Labour [-1.3] , Conservative [-1.6] , Green [-2.4] , Liberal [-2.9] , Labor [-2.9]
T-Rex

P530 Kenya maintains diplomatic relations with . Uganda India [-3.0] , Uganda [-3.2] , Tanzania [-3.5] , China [-3.6] , Pakistan [-3.6]
P176 iPod Touch is produced by . Apple Apple [-1.6] , Nokia [-1.7] , Sony [-2.0] , Samsung [-2.6] , Intel [-3.1]
P30 Bailey Peninsula is located in . Antarctica Antarctica [-1.4] , Bermuda [-2.2] , Newfoundland [-2.5] , Alaska [-2.7] , Canada [-3.1]
P178 JDK is developed by . Oracle IBM [-2.0] , Intel [-2.3] , Microsoft [-2.5] , HP [-3.4] , Nokia [-3.5]
P1412 Carl III used to communicate in . Swedish German [-1.6] , Latin [-1.9] , French [-2.4] , English [-3.0] , Spanish [-3.0]
P17 Sunshine Coast, British Columbia is located in . Canada Canada [-1.2] , Alberta [-2.8] , Yukon [-2.9] , Labrador [-3.4] , Victoria [-3.4]
P39 Pope Clement VII has the position of . pope cardinal [-2.4] , Pope [-2.5] , pope [-2.6] , President [-3.1] , Chancellor [-3.2]
P264 Joe Cocker is represented by music label . Capitol EMI [-2.6] , BMG [-2.6] , Universal [-2.8] , Capitol [-3.2] , Columbia [-3.3]
P276 London Jazz Festival is located in . London London [-0.3] , Greenwich [-3.2] , Chelsea [-4.0] , Camden [-4.6] , Stratford [-4.8]
P127 Border TV is owned by . ITV Sky [-3.1] , ITV [-3.3] , Global [-3.4] , Frontier [-4.1] , Disney [-4.3]
P103 The native language of Mammootty is . Malayalam Malayalam [-0.2] , Tamil [-2.1] , Telugu [-4.8] , English [-5.2] , Hindi [-5.6]
P495 The Sharon Cuneta Show was created in . Philippines Manila [-3.2] , Philippines [-3.6] , February [-3.7] , December [-3.8] , Argentina [-4.0]
AtLocation You are likely to find a overflow in a . drain sewer [-3.1] , canal [-3.2] , toilet [-3.3] , stream [-3.6] , drain [-3.6]
CapableOf Ravens can . fly fly [-1.5] , fight [-1.8] , kill [-2.2] , die [-3.2] , hunt [-3.4]
CausesDesire Joke would make you want to . laugh cry [-1.7] , die [-1.7] , laugh [-2.0] , vomit [-2.6] , scream [-2.6]
ConceptNet

Causes Sometimes virus causes . infection disease [-1.2] , cancer [-2.0] , infection [-2.6] , plague [-3.3] , fever [-3.4]
HasA Birds have . feathers wings [-1.8] , nests [-3.1] , feathers [-3.2] , died [-3.7] , eggs [-3.9]
HasPrerequisite Typing requires . speed patience [-3.5] , precision [-3.6] , registration [-3.8] , accuracy [-4.0] , speed [-4.1]
HasProperty Time is . finite short [-1.7] , passing [-1.8] , precious [-2.9] , irrelevant [-3.2] , gone [-4.0]
MotivatedByGoal You would celebrate because you are . alive happy [-2.4] , human [-3.3] , alive [-3.3] , young [-3.6] , free [-3.9]
ReceivesAction Skills can be . taught acquired [-2.5] , useful [-2.5] , learned [-2.8] , combined [-3.9] , varied [-3.9]
UsedFor A pond is for . fish swimming [-1.3] , fishing [-1.4] , bathing [-2.0] , fish [-2.8] , recreation [-3.1]

Table 3: Examples of generation for BERT-large. The last column reports the top five tokens generated together
with the associated log probability (in square brackets).

that BERT-large is able to recall such knowledge els trained on ever growing corpora might become
better than its competitors and at a level remark- a viable alternative to traditional knowledge bases
ably competitive with non-neural and supervised extracted from text in the future.
alternatives. Note that we did not compare the In addition to testing future pretrained language
ability of the corresponding architectures and ob- models using the LAMA probe, we are interested
jectives to capture knowledge in a given body of in quantifying the variance of recalling factual
text but rather focused on the knowledge present in knowledge with respect to varying natural lan-
the weights of existing pretrained models that are guage templates. Moreover, assessing multi-token
being used as starting points for many researchers’ answers remains an open challenge for our evalu-
work. Understanding which aspects of data our ation setup.
commonly-used models and learning algorithms
are capturing is a crucial field of research and this Acknowledgments
paper complements the many studies focused on
the learned linguistic properties of the data. We would like to thank the reviewers for their
thoughtful comments and efforts towards improv-
We found that it is non-trivial to extract a knowl- ing our manuscript. In addition, we would like
edge base from text that performs on par to di- to acknowledge three frameworks that were used
rectly using pretrained BERT-large. This is de- in our experiments: AllenNLP7 , Fairseq8 and the
spite providing our relation extraction baseline Hugging Face PyTorch-Transformers9 library.
with only data that is likely expressing target facts,
thus reducing potential for false negatives, as well
as using a generous entity-linking oracle. We References
suspected BERT might have an advantage due to
Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward
the larger amount of data it has processed, so we Hughes, Pushmeet Kohli, and Edward Grefenstette.
added Wikitext-103 as additional data to the re- 2019. Learning to understand goal specifications by
lation extraction system and observed no signif-
7
icant change in performance. This suggests that https://github.com/allenai/allennlp
8
https://github.com/pytorch/fairseq
while relation extraction performance might be 9
https://github.com/huggingface/
difficult to improve with more data, language mod- pytorch-transformers
modelling reward. In International Conference on Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Learning Representations (ICLR). Kristina Toutanova. 2018b. BERT: Pre-training
of Deep Bidirectional Transformers for Language
Marco Baroni, Georgiana Dinu, and Germán Understanding. arXiv:1810.04805 [cs]. ArXiv:
Kruszewski. 2014. Don’t count, predict! A 1810.04805.
systematic comparison of context-counting vs.
context-predicting semantic vectors. In Proceedings Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci,
of the 52nd Annual Meeting of the Association for Christophe Gravier, Jonathon Hare, Frederique
Computational Linguistics, ACL 2014, June 22-27, Laforest, and Elena Simperl. 2018. T-rex: A large
2014, Baltimore, MD, USA, Volume 1: Long Papers, scale alignment of natural language with knowledge
pages 238–247. base triples. In Proceedings of the Eleventh Interna-
tional Conference on Language Resources and Eval-
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and uation (LREC-2018).
Christian Janvin. 2003. A neural probabilistic lan-
guage model. Journal of Machine Learning Re- Yoav Goldberg. 2019. Assessing bert’s syntactic abili-
search, 3:1137–1155. ties. CoRR, abs/1901.05287.

Antoine Bordes, Nicolas Usunier, Alberto Garcı́a- Felix Hill, Roi Reichart, and Anna Korhonen. 2015.
Durán, Jason Weston, and Oksana Yakhnenko. Simlex-999: Evaluating semantic models with (gen-
2013. Translating embeddings for modeling multi- uine) similarity estimation. Computational Linguis-
relational data. In Advances in Neural Information tics, 41(4):665–695.
Processing Systems 26: 27th Annual Conference on
Neural Information Processing Systems 2013. Pro- Sepp Hochreiter and Jürgen Schmidhuber. 1997.
ceedings of a meeting held December 5-8, 2013, Long short-term memory. Neural Computation,
Lake Tahoe, Nevada, United States., pages 2787– 9(8):1735–1780.
2795.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia
S. R. K. Branavan, David Silver, and Regina Barzi- Rhinehart, Michael Collins, Ankur Parikh, Chris Al-
lay. 2011. Learning to win by reading manuals in a berti, Danielle Epstein, Illia Polosukhin, Matthew
monte-carlo framework. In The 49th Annual Meet- Kelcey, Jacob Devlin, et al. 2019. Natural questions:
ing of the Association for Computational Linguis- a benchmark for question answering research.
tics: Human Language Technologies, Proceedings
Jelena Luketina, Nantas Nardelli, Gregory Farquhar,
of the Conference, 19-24 June, 2011, Portland, Ore-
Jakob Foerster, Jacob Andreas, Edward Grefen-
gon, USA, pages 268–277.
stette, Shimon Whiteson, and Tim Rocktäschel.
2019. A Survey of Reinforcement Learning In-
Danqi Chen, Adam Fisch, Jason Weston, and Antoine formed by Natural Language. In Proceedings of
Bordes. 2017. Reading wikipedia to answer open- the Twenty-Eighth International Joint Conference
domain questions. CoRR, abs/1704.00051. on Artificial Intelligence, IJCAI 2019, August 10-16
2019, Macao, China.
Maxime Chevalier-Boisvert, Dzmitry Bahdanau,
Salem Lahlou, Lucas Willems, Chitwan Saharia, Rebecca Marvin and Tal Linzen. 2018. Targeted syn-
Thien Huu Nguyen, and Yoshua Bengio. 2018. tactic evaluation of language models. In Proceed-
Babyai: First steps towards grounded language ings of the 2018 Conference on Empirical Methods
learning with a human in the loop. CoRR, in Natural Language Processing, Brussels, Belgium,
abs/1810.08272. October 31 - November 4, 2018, pages 1192–1202.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. R. Thomas McCoy, Ellie Pavlick, and Tal Linzen.
Carbonell, Quoc V. Le, and Ruslan Salakhutdi- 2019. Right for the wrong reasons: Diagnosing syn-
nov. 2019. Transformer-xl: Attentive language tactic heuristics in natural language inference.
models beyond a fixed-length context. CoRR,
abs/1901.02860. Gábor Melis, Chris Dyer, and Phil Blunsom. 2017. On
the state of the art of evaluation in neural language
Yann N. Dauphin, Angela Fan, Michael Auli, and models. CoRR, abs/1707.05589.
David Grangier. 2017. Language modeling with
gated convolutional networks. In Proceedings of the Stephen Merity, Caiming Xiong, James Bradbury, and
34th International Conference on Machine Learn- Richard Socher. 2016. Pointer sentinel mixture
ing, ICML 2017, Sydney, NSW, Australia, 6-11 Au- models. CoRR, abs/1609.07843.
gust 2017, pages 933–941.
Tomas Mikolov and Geoffrey Zweig. 2012. Context
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and dependent recurrent neural network language model.
Kristina Toutanova. 2018a. BERT: pre-training of In 2012 IEEE Spoken Language Technology Work-
deep bidirectional transformers for language under- shop (SLT), Miami, FL, USA, December 2-5, 2012,
standing. CoRR, abs/1810.04805. pages 234–239.
Maximilian Nickel, Kevin Murphy, Volker Tresp, and Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,
Evgeniy Gabrilovich. 2016. A review of relational Adam Poliak, R Thomas McCoy, Najoung Kim,
machine learning for knowledge graphs. Proceed- Benjamin Van Durme, Sam Bowman, Dipanjan Das,
ings of the IEEE, 104(1):11–33. and Ellie Pavlick. 2019. What do you learn from
context? probing for sentence structure in contextu-
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt alized word representations. In International Con-
Gardner, Christopher Clark, Kenton Lee, and Luke ference on Learning Representations.
Zettlemoyer. 2018a. Deep contextualized word rep-
resentations. In Proceedings of the 2018 Confer- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
ence of the North American Chapter of the Associ- Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
ation for Computational Linguistics: Human Lan- Kaiser, and Illia Polosukhin. 2017. Attention is all
guage Technologies, NAACL-HLT 2018, New Or- you need. In Advances in Neural Information Pro-
leans, Louisiana, USA, June 1-6, 2018, Volume 1 cessing Systems 30: Annual Conference on Neural
(Long Papers), pages 2227–2237. Information Processing Systems 2017, 4-9 Decem-
ber 2017, Long Beach, CA, USA, pages 6000–6010.
Matthew E. Peters, Mark Neumann, Luke Zettlemoyer,
and Wen-tau Yih. 2018b. Dissecting contextual Alex Wang, Amanpreet Singh, Julian Michael, Fe-
word embeddings: Architecture and representation. lix Hill, Omer Levy, and Samuel R. Bowman.
In Proceedings of the 2018 Conference on Empirical 2018. GLUE: A multi-task benchmark and anal-
Methods in Natural Language Processing, Brussels, ysis platform for natural language understand-
Belgium, October 31 - November 4, 2018, pages ing. In Proceedings of the Workshop: Analyz-
1499–1509. ing and Interpreting Neural Networks for NLP,
Alec Radford, Karthik Narasimhan, Tim Salimans, and BlackboxNLP@EMNLP 2018, Brussels, Belgium,
Ilya Sutskever. 2018. Improving language under- November 1, 2018, pages 353–355.
standing by generative pre-training. Sean Welleck, Kianté Brantley, Hal Daumé III, and
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Kyunghyun Cho. 2019. Non-monotonic sequential
Dario Amodei, and Ilya Sutskever. 2019. Language text generation. arXiv preprint arXiv:1902.02192.
models are unsupervised multitask learners.
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and 2014. Recurrent neural network regularization.
Percy Liang. 2016. SQuAD: 100,000+ Questions CoRR, abs/1409.2329.
for Machine Comprehension of Text. In Proceed-
ings of the 2016 Conference on Empirical Methods Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin
in Natural Language Processing, pages 2383–2392, Choi. 2018. From recognition to cognition: Visual
Austin, Texas. Association for Computational Lin- commonsense reasoning. CoRR, abs/1811.10830.
guistics.
Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan
Siva Reddy, Danqi Chen, and Christopher D. Manning. Salakhutdinov, Raquel Urtasun, Antonio Torralba,
2018. Coqa: A conversational question answering and Sanja Fidler. 2015. Aligning books and movies:
challenge. CoRR, abs/1808.07042. Towards story-like visual explanations by watching
movies and reading books. In 2015 IEEE Interna-
Daniil Sorokin and Iryna Gurevych. 2017. Context- tional Conference on Computer Vision, ICCV 2015,
aware representations for knowledge base relation Santiago, Chile, December 7-13, 2015, pages 19–
extraction. In Proceedings of the 2017 Confer- 27.
ence on Empirical Methods in Natural Language
Processing, EMNLP 2017, Copenhagen, Denmark,
September 9-11, 2017, pages 1784–1789.
Robert Speer and Catherine Havasi. 2012. Represent-
ing general relational knowledge in conceptnet 5. In
LREC, pages 3679–3686.
Mihai Surdeanu and Heng Ji. 2014. Overview of the
English Slot Filling Track at the TAC2014 Knowl-
edge Base Population Evaluation. page 15.
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and
Jonathan Berant. 2019. Commonsenseqa: A ques-
tion answering challenge targeting commonsense
knowledge. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, NAACL-HLT 2019, Minneapolis, MN,
USA, June 2-7, 2019, Volume 1 (Long and Short Pa-
pers), pages 4149–4158.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy