0% found this document useful (0 votes)
36 views12 pages

EMNLP 2021 REBEL Camera Ready

Uploaded by

comeonitsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views12 pages

EMNLP 2021 REBEL Camera Ready

Uploaded by

comeonitsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

REBEL: Relation Extraction By End-to-end Language generation

Pere-Lluís Huguet Cabot Roberto Navigli


Sapienza University of Rome Sapienza University of Rome
& Babelscape, Italy navigli@diag.uniroma1.it
huguetcabot@babelscape.com

Abstract More recently, end-to-end approaches have been


Extracting relation triplets from raw text is used to tackle both tasks simultaneously (Miwa
a crucial task in Information Extraction, en- and Sasaki, 2014; Pawar et al., 2017; Katiyar and
abling multiple applications such as populat- Cardie, 2017; Eberts and Ulges, 2020). This task is
ing or validating knowledge bases, factcheck- usually referred to as Relation Extraction or End-
ing, and other downstream tasks. However, to-End Relation Extraction (RE). In this scenario, a
it usually involves multiple-step pipelines that model is trained simultaneously on both objectives.
propagate errors or are limited to a small num-
Specific parts of the model can be assigned differ-
ber of relation types. To overcome these is-
sues, we propose the use of autoregressive ent tasks of the pipeline, such as NER, on the one
seq2seq models. Such models have previously hand, and classifying the relations between the pre-
been shown to perform well not only in lan- dicted entities (RC), on the other. By training both
guage generation, but also in NLU tasks such tasks simultaneously, the model benefits from the
as Entity Linking, thanks to their framing as information bias between the tasks as in multi-task
seq2seq tasks. In this paper, we show how setups (Caruana, 1998), improving performance on
Relation Extraction can be simplified by ex-
the end-to-end RE task.
pressing triplets as a sequence of text and we
present REBEL, a seq2seq model based on Although successful, these models are often
BART that performs end-to-end relation ex- complex, with task-focused elements that need to
traction for more than 200 different relation be adapted to the number of relation or entity types,
types. We show our model’s flexibility by fine- or they are not flexible enough to work for texts
tuning it on an array of Relation Extraction and of different nature (sentence vs. document level)
Relation Classification benchmarks, with it at-
or domains. Moreover, they usually require long
taining state-of-the-art performance in most of
them. training times in order to be fine-tuned on new data.
In this paper, we present REBEL (Relation Ex-
1 Introduction
traction By End-to-end Language generation), an
Extracting relational facts from text has been an autoregressive approach that frames Relation Ex-
ongoing part of Natural Language Processing. The traction as a seq2seq task, together with the REBEL
ability to extract semantic relationships between en- dataset, a large-scale distantly supervised dataset,
tities from text can be used to go from unstructured obtained by leveraging a Natural Language In-
raw text to structured data that can be leveraged ference model. Our approach provides some up-
in an array of downstream tasks and applications, sides over previous end-to-end approaches thanks
such as the construction of Knowledge Bases. to our adoption of a simple triplet decomposi-
Traditionally this task has been approached as tion into a text sequence. By pre-training an
a two-step problem. First, the entities are ex- Encoder-Decoder Transformer (BART) using our
tracted from text as in Named Entity Recogni- new dataset, REBEL achieves state-of-the-art per-
tion (NER). Second, Relation Classification (RC) formance on an array of RE baselines within a few
checks whether there exists any pairwise relation epochs of fine-tuning. Its simplicity makes it highly
between the extracted entities (Zeng et al., 2014; flexible to adapt to new domains or longer docu-
Zhang et al., 2017). However, identifying which ments. As the same model weights are still utilized
entities truly share a relation can become a bottle- after the pre-training phase, there is no need to train
neck, requiring additional steps such as negative model-specific components from scratch, making
sampling and expensive annotation procedures. training more efficient.
Moreover, although it is devised for Relation BILOU tags and then a biaffine classifier extracts
Extraction, the same approach can be generalized their relations, sharing part of the encoders for
to Relation Classification, achieving competitive both tasks. These range from LSTMs (Miwa and
results. Bansal, 2016; Katiyar and Cardie, 2017) to CNNs
We make REBEL available1 both as a stan- (Adel and Schütze, 2017; Zheng et al., 2017) and,
dalone model that can extract more than 200 differ- lately, Transformer-based architectures (Eberts and
ent relation types, and as a pre-trained RE model Ulges, 2020), that explicitly predict and encode
that can be easily fine-tuned on new RE and RC entity spans instead of the BILOU approach used
datasets. We also provide the REBEL dataset and in NER.
the pipeline to extract high-quality RE datasets All recent sentence-level RE models are based
from any Wikipedia dump. on Transformer models, such as BERT (Eberts
and Ulges, 2020; Wang et al., 2020) or ALBERT
2 Related work (Lan et al., 2020; Wang and Lu, 2020). To tackle
2.1 Relation Extraction document-level RE, Eberts and Ulges (2021) use
a pipeline approach jointly trained on a multi-task
The term Relation Extraction is often used in the
setup that leverages coreference resolution to oper-
literature for different tasks and setups in the liter-
ate at an entity level, rather than mentions.
ature (Taillé et al., 2020). For clarity, we refer to
Relation Extraction (RE) as the task of extracting While the aforementioned work highlights the
triplets of relations between entities from raw text, relevance of Relation Extraction as a task, the lack
with no given entity spans, usually also called end- of consistent baselines or a cohesive task definition
to-end Relation Extraction. We refer to classifying has led to discrepancies in the use of datasets and
the relation between two entities in a given context the way models have been evaluated. Taillé et al.
as Relation Classification (RC). (2020) explain the different issues in-so-far, and
Early approaches tackled RE as a pipeline sys- also make an attempt to unify RE evaluation and
tem, identifying the entities present in the text using perform a fair comparison between systems.
Named Entity Recognition, and then classifying We will follow their guidelines and use strict
the relation, or lack of, between each pair of enti- evaluation, unless specified, for which a relation is
ties present in the text (RC). Therefore, early work considered correct only if the head and tail entity
made use of CNNs or LSTMs to exploit sentence- surface forms are correctly extracted (i.e., fully
level semantics and classify the relations between overlap with the annotation), as well as the relation
two given entities (Zeng et al., 2014; Zhou et al., and entity types (if available for the dataset).
2016). Current approaches to Relation Classifica-
tion use Transformer models, with (Yamada et al., 2.2 Seq2seq and Relation Extraction
2020) being the current state of the art by enhanc- The pipeline and table filling methods described so
ing BERT (Devlin et al., 2019) with entity-aware far have proved to perform well on RE, but still face
components. some challenges. They often assume at most one
Early end-to-end approaches using neural net- relation type between each entity pair, and multi-
works classified all word pairs present in the input class approaches do not take other predictions into
text (Miwa and Sasaki, 2014; Pawar et al., 2017) us- account. For instance, they could predict two “birth
ing table representation, or table filling, re-framing dates” for the same head entity, or predict relations
the task into filling the slots of a table (the rela- that are incompatible together. Moreover, they re-
tions) where rows and columns are the words in the quire all possible entity pairs to be inferred, which
input. More recently, Wang and Lu (2020) used a can become computationally expensive.
similar table-based formulation, where the table is Seq2seq approaches for RE (Zeng et al., 2018,
explicitly encoded using a table-sequence encoder. 2020; Nayak and Ng, 2020) offer some off-the-
Finally, there are pipeline systems that tackle shelf solutions to these problems. Decoding mech-
both parts of Relation Extraction, NER, and RC, by anisms can output the same entities multiple times,
jointly training components that take advantage of as well as conditioning future decoding on previous
the information shared between the tasks. In these predictions, implicitly dealing with incompatible
setups, entities are first extracted as in NER using ones. However, as Zhang et al. (2020) discuss,
1
https://github.com/babelscape/rebel they still pose some issues. The triplets need to
be linearized into a somewhat arbitrary sequential need to express the triplets as a sequence of tokens
order, such as the alphabetical one. This issue is to be decoded by the model. We design a reversible
explored by Zeng et al. (2019), who use Reinforce- linearization using special tokens that enable the
ment Learning to compute the extraction order for model to output the relations in the text in the form
the triplets. Moreover, seq2seq approaches suffer of triplets while minimizing the number of tokens
from exposure bias, since at training time the pre- that need to be decoded.
diction is always dependent on the gold-standard For REBEL, we have as input the text from the
output. In Zhang et al. (2020) a tree-decoding ap- dataset and, as output, the linearized triplets. If x
proach mitigates these issues while still using an is our input sentence and y the result of linearizing
autoregressive seq2seq approach. the relations in x as explained in Section 3.1, the
In the meantime, seq2seq Transformer models, task for REBEL is to autoregressively generate y
such as BART (Lewis et al., 2020) or T5 (Raf- given x:
fel et al., 2020) have been used in NLU tasks
len(y)
such as Entity Linking (Cao et al., 2021), AMR Y
pBART (y | x) = pBART (yi | y<i , x)
parsing (Bevilacqua et al., 2021), Semantic Role i=1
Labeling (Blloshmi et al., 2021) or Word Sense-
Disambiguation (Bevilacqua et al., 2020) by re- By fine-tuning BART on such a task, using the
framing them as seq2seq tasks. Not only do they Cross-Entropy loss as in Summarization or Ma-
show strong performance, but they also showcase chine Translation, we maximize the log-likelihood
the flexibility of seq2seq models by not relying on of generating the linearized triplets given the input
predefined entity sets, but rather on the decoding text.
mechanism, which can easily be extended to new
3.1 Triplets linearization
or unseen entities.
For our model, we employ an Encoder-Decoder For RE, we want to express triplets as a sequence
framework that can alleviate some of the previ- of tokens such that we can retrieve the original
ous issues seq2seq for RE has faced. While expo- relations and minimize the number of tokens to
sure bias can still occur, the attention mechanism be generated so as to make decoding more effi-
enables long-distance dependencies as well as at- cient. We introduce a set of new tokens, as mark-
tending (or not) to the previously decoded output. ers, to achieve the aforementioned linearization.
Additionally, we devise a novel triplet linearization <triplet> marks the start of a new triplet with
with a consistent triplet ordering that enables the a new head entity, followed by the surface form
model to leverage both the encoded input and the of that entity in the input text. <subj> marks
already decoded output. the end of the head entity and the start of the tail
entity surface form. <obj> marks the end of the
3 REBEL tail entity and the start of the relation between the
head and tail entity, in its surface form. To obtain a
We tackle Relation Extraction and Classification as consistent order in the decoded triplets, we sort the
a generation task: we use an autoregressive model entities by their order of appearance in the input
that outputs each triplet present in the input text. text and linearize the triplets following that order.
To this end, we employ BART-large (Lewis et al., Triplets will also be grouped by head entity. There-
2020) as the base model. fore, the first triplet will be the one with the first
In a translation task, teacher forcing leverages appearing head entity and the following relation
pairs of text in two languages by conditioning the will be the one with the first appearing tail entity
decoded text on the input. At training time the related to that head entity, followed by the rest of
encoder receives the text in one language, and the triplets with the same head entity. There is no need
decoder receives the text in the other language, to specify the head entity each time, reducing the
outputting the prediction for the next token at each decoded text length. Once there are no more rela-
position. tions with that head entity, a new group of relations
In our approach, we translate a raw input sen- will start, with the second appearing head entity in
tence containing entities, together with implicit the text, repeating the same process until there are
relations between them, into a set of triplets that no more triplets to be linearized. This mechanism
explicitly refer to those relations. Therefore, we is described in Algorithm 1.
“This Must Be the Place” is a song by new wave band
Talking Heads, released in November 1983 as the <triplet> This Must Be the Place
second single from its fifth album “Speaking in <subj> Talking Heads <obj> performer
Tongues”

(This Must Be the Place, performer, Talking Heads)


(Talking Heads, genre, new wave)
} <subj> Speaking in Tongues <obj> part of
<triplet> Talking Heads <subj> new
wave <obj> genre <triplet> Speaking in
Tongues <subj> Talking Heads <obj>
performer
(This Must Be the Place, part of, Speaking in Tongues)
(Speaking in Tongues, performer, Talking Heads)

Figure 1: Example of the triplet linearization process for REBEL.

Algorithm 1: Transform a set of relations 3.2 REBEL dataset


R into a text sequence Autoregressive transformer models such as BART
Result: or T5, have been shown to perform well on dif-
lin_triplets with all triplets as a sequence ferent generative tasks such as translation or sum-
of text. marization, but they do require large amounts of
Input: data to be trained. On the other hand, end-to-end
E = Entities; relation extraction datasets are scarce and often
R = Relations; small.
sort() Sorts by placement in input text; In Elsahar et al. (2018) the T-REx dataset was
Start: created by devising a pipeline that extracts entities
E = sort(E); and relations from DBpedia abstracts to overcome
lin_triplets = ""; this lack of big RE datasets. While the result is a
for e ∈ E do large dataset, the quality of the annotation presents
R(e) = relations with e as subject; some issues. First, the use of a somewhat old entity
R(e) = sort(R(e)); linking tool (Daiber et al., 2013) leads to entities
lin_triplets += <triplet> + e; being wrongly disambiguated. Since the relations
for r ∈ R(e) do are extracted by using those entities, this leads to
o = E(e, r) object of relation r; missing or faulty relations. Moreover, most of the
lin_triplets += <subj> + o + relations are extracted by assuming that, if the two
<obj> + r; entities are present in the text, the relation is there-
end fore entailed by this presence.
end We overcome these issues by expanding upon
their pipeline to create a large silver dataset, used
as pre-training for REBEL. We use Wikipedia2
abstracts, that is, the part of each Wikipedia
page before the table of contents, extracted using
Figure 1 shows an example of the linearization wikiextractor (Attardi, 2015). Then, we link
process for a list of relations and an input sentence. the entities present in the text as hyperlinks, to-
Notice how This Must Be the Place appears twice gether with dates and values, to Wikidata entities
as a subject, but it is present only once in the out- using wikimapper3 . From this, we extract all
put as a subject entity. The original triplets can the relations present between those entities in Wiki-
easily be retrieved by taking the special tokens data. Our system can be used with any Wikipedia
into account. In RE datasets, the entity types are dump, in multiple languages, enabling light and
also present in the triplets and need to be predicted quick extraction using a multi-core process and
by the model. In that case, we apply a modifica- SQL to avoid memory issues with the Wikidata
tion of Algorithm 1 where instead of <subj> and dump.
<obj>, we add new tokens for each entity type, However, a relation in Wikidata does not nec-
such as <per> or <org>, for person or organiza- 2
Downloaded on 2021/02/01 from: https://dumps.
tion, respectively, and use them in the same fashion, wikimedia.org/enwiki/
3
indicating the type of the entity they follow. https://pypi.org/project/wikimapper/
Entity Types Relation Types Train Validation Test
CONLL04 4 5 1,290 (922) 343 (231) 422 (288)
NYT 3 24 94,222 (56,196) 8,489 (5,000) 8,616 (5,000)
DocRED 6 96 3,7486 (3,008) 3,678 (300) 8,787 (700)
ADE 2 1 6,821 (4,272) - -
Re-TACRED 17 40 58,465 (58,465) 19,584 (19,584) 13,418 (13,418)
REBEL (sent.) - 220 878,555 (784,202) 48,514 (43,341) 48,852 (43,506)
REBEL (full) - 1,146 9,282,837 (2,754,387) 513,270 (152,672) 515,186 (152,835)

Table 1: Dataset statistics. Number of triplets with number of instances in parenthesis.

essarily mean that the relation is entailed within datasets, including our pre-training dataset, can be
the text. Although in Elsahar et al. (2018) high found in Table 1.
reliability is claimed using this method, it has been While the training objective is on the autoregres-
shown to be noisy for frequent relations such as sive task, we evaluate the model on RE, extracting
country or spouse, and we have found several re- all the triplets from the generated output, and eval-
lated annotation issues. We utilize a pre-trained uating using Recall, Precision, and micro-F1 based
RoBERTa (Liu et al., 2019) Natural Language In- on the labeled triplets. For a triplet to be considered
ference (NLI) model4 to tackle this issue, and use correct, the entities and the relation, as well as their
its entailment prediction to filter those relations not types, have to be the same as the labeled ones (this
entailed by the Wikipedia text. For each triplet, is known as “strict” evaluation in RE) using the
we input the text containing both entities from the evaluation code from Taillé et al. (2020).
Wikipedia abstract, and the triplet in their surface
forms, subject + relation + object, sepa- 4.1 REBEL dataset
rated by the <sep> token. We create this dataset by matching Wikipedia hy-
For the previous example and the triplet (Talking perlinks with Wikidata entities as explained in Sec-
Heads, genre, new wave), we input: “This Must tion 3.2. To pre-train our model, we use a sentence-
Be the Place” is a song by new wave band Talking level version of it, where only relations between
Heads, released in November 1983 as the second entities present in each sentence are kept. We keep
single from its fifth album “Speaking in Tongues”. the 220 most frequent relations in the train split.
<sep> Talking Heads genre new wave. We keep We fine-tune REBEL (using BART-large as the
those triplets for which the entailment prediction base model) on the silver dataset for 6 epochs. We
is higher than 0.75. This proves successful in cre- refer to the resulting model as REBELpre−training .
ating cleaner data in preliminary experiments and While REBELpre−training is in and of itself capa-
removing noisy annotations. We create three ran- ble of extracting relations subsuming about 220
dom splits, with validation and test each being 5% types, we show that it also functions as a base step
of the total data. for downstream RE and RC tasks, which are fine-
While this data extraction pipeline may still keep tuned on top of it.
some noise, or exclude some relations that are en-
tailed by the text, it enables an automatic way 4.2 CONLL04
of gathering millions of entities and relations as CONLL04 (Roth and Yih, 2004) is composed of
a silver dataset, sufficient for training our model. sentences from news articles, annotated with four
We name our RE dataset creation tool cRocoDiLe: entity types (person, organization, location and
Automatic Relation Extraction Dataset with NLI other) and five relation types (kill, work for, or-
filtering, and we make it available here5 . ganization based in, live in and located in). To
compare with previous work, we use the test split
4 Experimental Setup from Gupta et al. (2016), and the same validation
In this section, we describe the setup to train and set as Eberts and Ulges (2020), although we do not
evaluate REBEL for four different widely used RE include the validation set at final training time.
datasets and one RC dataset. Statistics for all the For CONLL04 we expand REBEL to include
4 entity types. As described in Section 3.1, we intro-
xlm-roberta-large-xnli
5
https://github.com/Babelscape/ duce a set of new tokens for each entity type. For
crocodile CONLL04 these are <peop>, <org>, <loc>,
<other>. We fine-tune on top of REBEL for 30 the single relation Adverse-Effect. Thus, we keep
epochs and test on the best performing epoch on the same setup as with REBEL, using the <subj>
the validation set. token to distinguish between entity types, and re-
moving the relation from the output, as it is always
4.3 DocRED the same.
DocRED (Yao et al., 2019) is a recent dataset cre- We fine-tune on top of REBEL for 25 epochs
ated similarly to our pre-training data, by leverag- and evaluate using the last checkpoint for each fold
ing Wikipedia and Wikidata. However, it focuses in the dataset. Hyperparameters are selected by
on longer spans of text, with relations between using 10% of the training data in the first fold.
entities at a document level. There is a distantly su-
pervised portion, while the validation and (hidden) 4.6 Re-TACRED
test sets are manually annotated. It includes anno- Re-TACRED (Stoica et al., 2021) is a Relation Clas-
tations for 6 different entity types and 96 relation sification dataset, a revised version of the widely
types. used TACRED (Zhang et al., 2017), fixing some of
Despite the fact that DocRED was originally the issues pointed out by Alt et al. (2020). We want
designed as a relation classification task, we use to extract the relation between two given entities, or
the splits from Eberts and Ulges (2021) and tackle the no_relation prediction, accounting for 63% of
it as a relation extraction task. In DocRED there the 91,467 sentences in the dataset. To this end, we
are 6 entity types, consequently we use the tokens: follow the approach from Zhou and Chen (2021)
<loc>, <misc>, <per>, <num>, <time> and and Zhou and Chen (2021) and mark the entities in
<org> to indicate them. the input text using punctuation marks. We do not
We fine-tune on top of REBEL for 21 epochs include any entity-type information.
and test on the last checkpoint, using a beam search The output is treated as in previous tasks, and
of 10. For REBELpre−training , we use a version we do not force the decoding of the given entities,
trained on a filtered dataset not including any of as we find it is sufficient to mark them in the input.
the Wikipedia pages present in DocRED validation We fine-tune on top of REBEL for 8 epochs and
or test sets. evaluate using the last checkpoint.
4.4 NYT 5 Results
NYT (Riedel et al., 2010) is a dataset consisting of
5.1 Relation Extraction
news sentences from the New York Times corpus.
The dataset contains distantly annotated relations For our pre-training task using the REBEL dataset,
using FreeBase. We use the processed version of the model achieves 74 micro-F1 and 51 macro-F1.
Zeng et al. (2018) called NYT-multi, which con- The dataset is created by distant supervision and
tains overlapping entities, with three different entity serves as a pre-training step, however, it is worth
types, and 24 relation types. noting its performance for predicting up to 220
We use <loc>, <per> and <org> to indicate different relation types.
the 3 entity types. As for the 24 relation types, we Results on selected baselines are presented in Ta-
map these to natural language expressions to match ble 2, as well as additional metrics in Tables 3 and
those seen at pre-training. 4. We see an improvement across all datasets with
We fine-tune on top of REBEL for a maximum pre-trained REBEL, achieving between 1.2 and
of 42 epochs and test on the best performing epoch 6.7 absolute F1 points improvement over recent
on the validation set. state-of-the-art models. Using REBEL without the
pre-training, we see that performance decreases,
4.5 ADE especially for smaller datasets or those with many
ADE (Gurulingappa et al., 2012) is a dataset on entity types. Nevertheless, it still achieves com-
the biomedical domain, for which Adverse-Effects petitive results, showing the flexibility of tackling
from drugs are annotated as pairs of drug and RE as a seq2seq task using Transformer Encoder-
adverse-effect. The dataset provides 10-folds of Decoder models.
train and test splits. Additionally, REBEL shows a better perfor-
Drug and Adverse-Effect are the two entity types, mance than TANL, which was trained in a seq2seq
and are always the subject and object entities for fashion as well, using T5, with BART achieving
CONLL04 NYT DocRED ADE
Strict Evaluation
SpERT (Eberts and Ulges, 2020) 71.5† - - 79.2
Table-sequence (Wang and Lu, 2020) 73.6 - - 80.1‡
JEREX (Eberts and Ulges, 2021) - - 40.4 -
TANL (Paolini et al., 2021) 71.4† 90.8 - 80.6
TANL (multi-dataset) (Paolini et al., 2021) 72.6† 90.5 - 80.0
REBEL 71.2 91.8 41.8 81.7
REBELpre−training 75.4 92.0 47.1 82.2
Boundaries Evaluation
TPLinker (Wang et al., 2020) - 91.9 - -
REBELpre−training - 93.4 - -

Table 2: Comparison (Micro-F1) with most recent systems. † = explicit use of train+dev ‡ = filtered overlapping
entities (2.8%)

Precision Recall F1 Precision Recall F1


75.59 75.12 75.35 75.22 69.01 71.97
CONLL04 CONLL04
±1.53 ±0.64 ±1.01 ±1.30 ±1.68 ±1.00
91.71 92.21 91.96 91.50 92.02 91.76
NYT NYT
±0.10 ±0.14 ±0.07 ±0.12 ±0.11 ±0.04
45.89 48.37 47.10 38.75 45.48 41.84
DocRED DocRED
±0.44 ±0.44 ±0.19 ±0.54 ±0.36 ±0.40
81.45 83.07 82.21 80.80 82.62 81.69
ADE ADE
±1.51 ±1.25 ±1.08 ±2.13 ±1.45 ±1.70
89.48 91.25 90.36 89.41 91.39 90.39
Re-TACRED Re-TACRED
±0.32 ±0.22 ±0.23 ±0.50 ±0.12 ±0.26

Table 3: Average micro metrics over 5 seeds (10-folds Table 4: Average micro metrics over 5 seeds for
for ADE) for REBELpre−training . Standard deviation REBEL on test sets. Standard deviation is indicated
is indicated after the ± symbol. after the ± symbol.

lower results for their approach. Therefore, our be expensive to train, and shorter training time can
triplet linearization approach shows an improve- significantly decrease the costs.
ment over other decoding strategies.
Results on RE for DocRED show that, despite 5.2 Budget Training
being pre-trained on a sentence-based RE, REBEL We explore the training efficiency of
can perform competitively on document-level RE, REBELpre_trained , and show the performance
without the need for complex pipelines. when fine-tuned on a low number of epochs.
Moreover, by having a pre-trained version avail- We experiment with CONLL04 and NYT com-
able, REBEL enables quick fine-tuning on newer pared to the non-pre-trained model, SpERT and
domains, such as ADE, with different or fewer rela- TANL. SpERT was trained for just 20 epochs on
tion types, or including entity types. While in order CONLL04, while TANL in its non-multi-dataset
to achieve the best performance we train for longer version is trained for 200 epochs. We adjust each
epochs, REBEL still needs fewer training steps to learning rate scheduler to the number of epochs
achieve competitive results compared to the other and re-train each model for different epochs and
systems. For instance, Paolini et al. (2021) train seeds.
CONLL04 for up to 200 epochs, Wang and Lu Figures 2 and 3 show how in just 8 epochs for
(2020) for up to 5,000, while our model needs less CONLL04 and 3 for NYT, REBELpre_trained can
than 30 to achieve state-of-the-art results. Each of achieve a similar performance as the previous state
these systems uses large language models that can of the art. While the experiments are on the dev set,
74 F1
72
LUKE (Yamada et al., 2020) 90.3
RoBERTaLARGE
70 90.5
+ entity marker (Zhou and Chen, 2021)
F1

68 REBEL 90.4
66 REBELpre−training 90.4
REBEL + pre-training
REBEL
64 TANL Table 5: Results on Re-TACRED
SpERT
6 7 8 9 10 11 12
epochs
5.3 Relation Classification
Figure 2: Micro-F1 performances on CONLL04 dev As Table 5 shows, REBEL performs fairly well on
set averaged over 5 seeds. RC despite being designed for RE. While Zhou and
Chen (2021) presented a model with better results
(91.1 F1) using entity types, we compare our mod-
90 els with those that do not use them. Both versions
88 of REBEL achieve the same performance, in this
86 case, in contrast to what we saw with RE. This may
be due to the pre-training task being solely RE, as
F1

84
well as the size of the dataset.
82
REBEL + pre-training For REBEL, we evaluate using free generation
80 REBEL
TANL in the RC setup. Paolini et al. (2021) use likelihood-
78 Spert
based prediction which leads to an increase in per-
1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
epochs formance by computing the likelihood of each rela-
tion type to be decoded with the two given entities.
Figure 3: Micro-F1 performances on NYT dev set av- However, this also leads to an overhead of compu-
eraged over 3 seeds. tation for datasets with a high number of relations
such as Re-TACRED. For this reason, we use free
generation and are unable to compute results for
we do not observe big differences in performance Re-TACRED using TANL.
between test and dev for these two datasets (see
6 Conclusion
Appendix A.1 Tables 6 and 8). These results also
highlight the importance of pre-training REBEL, We have presented REBEL, alongside a new dis-
as it achieves close to the final performance within tantly supervised dataset for pre-training. REBEL
a few epochs. Also note that while other mod- frames RE into a seq2seq task and, by leveraging
els achieve lower performances, they also reach BART, achieves state-of-the-art performances in
close to their final ones. Training for longer times an array of RE benchmarks. We have also shown
and using early stopping on the validation perfor- its flexibility in adapting to new domains, by train-
mance are approaches used by most state-of-the-art ing on just a few epochs to attain results that are
models, but this can lead to long and expensive comparable to the previous state of the art, as well
training times. Our experiments show that training as the possibility of using it to perform Relation
for fewer epochs may lead to a small decrease in Classification.
performance, but it brings the benefit of a more We make REBELpre−training available as a stan-
affordable training time. The comparison with dalone RE for more than 200 relation types together
other models should also take into account that our with a pre-trained RE model to serve as a baseline
pre-trained approach has been previously trained when fine-tuning on new RE datasets. Nonetheless,
on a massive dataset for 6 epochs, which combined REBEL is based on BART-large, which has a big
with the fine-tuning in this experiment would lead parameter footprint. Therefore, we also plan to re-
to longer training times. However, all the other lease a pre-trained REBEL-base using BART-base.
models also rely on pre-trained LM and, similarly, This will enable quick and efficient RE.
REBEL just needs to be pre-trained once and then Moreover, our dataset creation pipeline enables
quickly fine-tuned on these new datasets. a quick and effortless way of obtaining large high-
quality RE datasets in multiple languages from a Rexhina Blloshmi, Simone Conia, Rocco Tripodi, and
Wikipedia dump. Since both Wikipedia and Wiki- Roberto Navigli. 2021. Generating senses and roles:
An end-to-end model for dependency- and span-
data are in constant change, our method provides
based semantic role labeling. In Proceedings of the
a way to keep up with those changes and to have Thirtieth International Joint Conference on Artifi-
up-to-date RE datasets. cial Intelligence, IJCAI-21, pages 3786–3793. Inter-
We leave to future work the possibility of using national Joint Conferences on Artificial Intelligence
a multi-dataset approach as in Paolini et al. (2021), Organization. Main Track.
including both RE and RC datasets, and seeing if Nicola De Cao, Gautier Izacard, Sebastian Riedel, and
it retains or improves performance. Furthermore, Fabio Petroni. 2021. Autoregressive entity retrieval.
using our silver dataset as pre-training could lead In International Conference on Learning Represen-
to improved performance for other systems, espe- tations.
cially those which have shown better performance
Rich Caruana. 1998. Multitask Learning, pages 95–
than REBEL without pre-training, such as Wang 133. Springer US, Boston, MA.
and Lu (2020) for CONLL04.
Joachim Daiber, Max Jakob, Chris Hokamp, and
Acknowledgments Pablo N. Mendes. 2013. Improving efficiency and
accuracy in multilingual entity extraction. In Pro-
We would like to thank the authors of Elsahar ceedings of the 9th International Conference on Se-
et al. (2018) for the T-REx open code from which mantic Systems, I-SEMANTICS ’13, page 121–124,
cRocoDiLe was built. New York, NY, USA. Association for Computing
This research was funded by the European Machinery.
Union’s H2020 Marie Skłodowska-Curie project
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Knowledge Graphs at Scale (KnowGraphs) under Kristina Toutanova. 2019. BERT: Pre-training of
H2020-EU.1.3.1. (grant agreement ID: 860801). deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
References for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
Heike Adel and Hinrich Schütze. 2017. Global nor- pages 4171–4186, Minneapolis, Minnesota. Associ-
malization of convolutional neural networks for joint ation for Computational Linguistics.
entity and relation classification. In Proceedings
of the 2017 Conference on Empirical Methods in Markus Eberts and Adrian Ulges. 2020. Span-based
Natural Language Processing, pages 1723–1729, joint entity and relation extraction with transformer
Copenhagen, Denmark. Association for Computa- pre-training. In ECAI, pages 2006–2013.
tional Linguistics.
Christoph Alt, Aleksandra Gabryszak, and Leonhard Markus Eberts and Adrian Ulges. 2021. An end-to-end
Hennig. 2020. TACRED revisited: A thorough eval- model for entity-level relation extraction using multi-
uation of the TACRED relation extraction task. In instance learning. In Proceedings of the 16th Con-
Proceedings of the 58th Annual Meeting of the Asso- ference of the European Chapter of the Association
ciation for Computational Linguistics, pages 1558– for Computational Linguistics: Main Volume, pages
1569, Online. Association for Computational Lin- 3650–3660, Online. Association for Computational
guistics. Linguistics.

Giusepppe Attardi. 2015. Wikiextractor. https:// Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci,
github.com/attardi/wikiextractor. Christophe Gravier, Jonathon Hare, Frederique
Michele Bevilacqua, Rexhina Blloshmi, and Roberto Laforest, and Elena Simperl. 2018. T-REx: A large
Navigli. 2021. One spring to rule them both: Sym- scale alignment of natural language with knowledge
metric amr semantic parsing and generation with- base triples. In Proceedings of the Eleventh Interna-
out a complex pipeline. Proceedings of the AAAI tional Conference on Language Resources and Eval-
Conference on Artificial Intelligence, 35(14):12564– uation (LREC 2018), Miyazaki, Japan. European
12573. Language Resources Association (ELRA).

Michele Bevilacqua, Marco Maru, and Roberto Nav- Pankaj Gupta, Hinrich Schütze, and Bernt Andrassy.
igli. 2020. Generationary or “how we went beyond 2016. Table filling multi-task recurrent neural net-
word sense inventories and learned to gloss”. In work for joint entity and relation extraction. In Pro-
Proceedings of the 2020 Conference on Empirical ceedings of COLING 2016, the 26th International
Methods in Natural Language Processing (EMNLP), Conference on Computational Linguistics: Techni-
pages 7207–7221, Online. Association for Computa- cal Papers, pages 2537–2547, Osaka, Japan. The
tional Linguistics. COLING 2016 Organizing Committee.
Harsha Gurulingappa, Abdul Mateen Rajput, Angus Soatto. 2021. Structured prediction as translation be-
Roberts, Juliane Fluck, Martin Hofmann-Apitius, tween augmented natural languages. In 9th Inter-
and Luca Toldo. 2012. Development of a benchmark national Conference on Learning Representations,
corpus to support the automatic extraction of drug- ICLR 2021.
related adverse effects from medical case reports.
Journal of Biomedical Informatics, 45(5):885–892. Sachin Pawar, Pushpak Bhattacharyya, and Girish Pal-
Text Mining and Natural Language Processing in shikar. 2017. End-to-end relation extraction using
Pharmacogenomics. neural networks and Markov Logic Networks. In
Proceedings of the 15th Conference of the European
Arzoo Katiyar and Claire Cardie. 2017. Going out on Chapter of the Association for Computational Lin-
a limb: Joint extraction of entity mentions and re- guistics: Volume 1, Long Papers, pages 818–827,
lations without dependency trees. In Proceedings Valencia, Spain. Association for Computational Lin-
of the 55th Annual Meeting of the Association for guistics.
Computational Linguistics (Volume 1: Long Papers),
pages 917–928, Vancouver, Canada. Association for Colin Raffel, Noam Shazeer, Adam Roberts, Kather-
Computational Linguistics. ine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, the limits of transfer learning with a unified text-to-
Kevin Gimpel, Piyush Sharma, and Radu Soricut. text transformer. Journal of Machine Learning Re-
2020. ALBERT: A lite BERT for self-supervised search, 21(140):1–67.
learning of language representations. In 8th Inter-
national Conference on Learning Representations, Sebastian Riedel, Limin Yao, and Andrew McCal-
ICLR 2020, Addis Ababa, Ethiopia, April 26-30, lum. 2010. Modeling relations and their mentions
2020. without labeled text. In Machine Learning and
Knowledge Discovery in Databases, pages 148–163,
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- Berlin, Heidelberg. Springer Berlin Heidelberg.
jan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer. Dan Roth and Wen-tau Yih. 2004. A linear program-
2020. BART: Denoising sequence-to-sequence pre- ming formulation for global inference in natural lan-
training for natural language generation, translation, guage tasks. In Proceedings of the Eighth Confer-
and comprehension. In Proceedings of the 58th An- ence on Computational Natural Language Learn-
nual Meeting of the Association for Computational ing (CoNLL-2004) at HLT-NAACL 2004, pages 1–8,
Linguistics, pages 7871–7880, Online. Association Boston, Massachusetts, USA. Association for Com-
for Computational Linguistics. putational Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
George Stoica, Emmanouil Antonios Platanios, and
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Barnabas Poczos. 2021. Re-tacred: Addressing
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
shortcomings of the tacred dataset. Proceedings
Roberta: A robustly optimized bert pretraining ap-
of the AAAI Conference on Artificial Intelligence,
proach.
35(15):13843–13850.
Makoto Miwa and Mohit Bansal. 2016. End-to-end re-
lation extraction using LSTMs on sequences and tree Bruno Taillé, Vincent Guigue, Geoffrey Scoutheeten,
structures. In Proceedings of the 54th Annual Meet- and Patrick Gallinari. 2020. Let’s Stop Incorrect
ing of the Association for Computational Linguistics Comparisons in End-to-end Relation Extraction! In
(Volume 1: Long Papers), pages 1105–1116, Berlin, Proceedings of the 2020 Conference on Empirical
Germany. Association for Computational Linguis- Methods in Natural Language Processing (EMNLP),
tics. pages 3689–3701, Online. Association for Computa-
tional Linguistics.
Makoto Miwa and Yutaka Sasaki. 2014. Modeling
joint entity and relation extraction with table repre- Jue Wang and Wei Lu. 2020. Two are better than
sentation. In Proceedings of the 2014 Conference on one: Joint entity and relation extraction with table-
Empirical Methods in Natural Language Processing sequence encoders. In Proceedings of the 2020 Con-
(EMNLP), pages 1858–1869, Doha, Qatar. Associa- ference on Empirical Methods in Natural Language
tion for Computational Linguistics. Processing (EMNLP), pages 1706–1721, Online. As-
sociation for Computational Linguistics.
Tapas Nayak and Hwee Tou Ng. 2020. Effective mod-
eling of encoder-decoder architecture for joint entity Yucheng Wang, Bowen Yu, Yueyang Zhang, Tingwen
and relation extraction. Proceedings of the AAAI Liu, Hongsong Zhu, and Limin Sun. 2020.
Conference on Artificial Intelligence, 34(05):8528– TPLinker: Single-stage joint extraction of entities
8535. and relations through token pair linking. In Proceed-
ings of the 28th International Conference on Com-
Giovanni Paolini, Ben Athiwaratkun, Jason Krone, putational Linguistics, pages 1572–1582, Barcelona,
Jie Ma, Alessandro Achille, Rishita Anubhai, Ci- Spain (Online). International Committee on Compu-
cero Nogueira dos Santos, Bing Xiang, and Stefano tational Linguistics.
Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-
Takeda, and Yuji Matsumoto. 2020. LUKE: Deep geli, and Christopher D. Manning. 2017. Position-
contextualized entity representations with entity- aware attention and supervised data improve slot
aware self-attention. In Proceedings of the 2020 filling. In Proceedings of the 2017 Conference on
Conference on Empirical Methods in Natural Lan- Empirical Methods in Natural Language Processing,
guage Processing (EMNLP), pages 6442–6454, On- pages 35–45, Copenhagen, Denmark. Association
line. Association for Computational Linguistics. for Computational Linguistics.

Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Suncong Zheng, Yuexing Hao, Dongyuan Lu,
Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, Hongyun Bao, Jiaming Xu, Hongwei Hao, and
and Maosong Sun. 2019. DocRED: A large-scale Bo Xu. 2017. Joint entity and relation extraction
document-level relation extraction dataset. In Pro- based on a hybrid neural network. Neurocomput-
ceedings of the 57th Annual Meeting of the Associa- ing, 257:59–66. Machine Learning and Signal
tion for Computational Linguistics, pages 764–777, Processing for Big Multimedia Analysis.
Florence, Italy. Association for Computational Lin-
guistics. Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li,
Hongwei Hao, and Bo Xu. 2016. Attention-based
Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, bidirectional long short-term memory networks for
and Jun Zhao. 2014. Relation classification via con- relation classification. In Proceedings of the 54th
volutional deep neural network. In Proceedings of Annual Meeting of the Association for Computa-
COLING 2014, the 25th International Conference tional Linguistics (Volume 2: Short Papers), pages
on Computational Linguistics: Technical Papers, 207–212, Berlin, Germany. Association for Compu-
pages 2335–2344, Dublin, Ireland. Dublin City Uni- tational Linguistics.
versity and Association for Computational Linguis-
tics. Wenxuan Zhou and Muhao Chen. 2021. An im-
proved baseline for sentence-level relation extrac-
Daojian Zeng, Haoran Zhang, and Qianying Liu. 2020. tion. CoRR, abs/2102.01373.
Copymtl: Copy mechanism for joint extraction of
entities and relations with multi-task learning. In
The Thirty-Fourth AAAI Conference on Artificial In- A Appendix
telligence, AAAI 2020, The Thirty-Second Innova-
tive Applications of Artificial Intelligence Confer- A.1 Results
ence, IAAI 2020, The Tenth AAAI Symposium on Ed- Performances on the different dev sets can be found
ucational Advances in Artificial Intelligence, EAAI
2020, New York, NY, USA, February 7-12, 2020, in Tables 6 and 8.
pages 9507–9514. AAAI Press.
Precision Recall F1
Xiangrong Zeng, Shizhu He, Daojian Zeng, Kang Liu, 77.53 74.2 76.13
Shengping Liu, and Jun Zhao. 2019. Learning the CONLL04
±1.96 ±1.26 ±1.02
extraction order of multiple relational facts in a sen-
tence with reinforcement learning. In Proceedings 91.64 92.31 91.97
NYT
of the 2019 Conference on Empirical Methods in ±0.26 ±0.12 ±0.13
Natural Language Processing and the 9th Interna- 46.65 49.19 47.89
tional Joint Conference on Natural Language Pro- DocRED
±0.94 ±0.43 ±0.68
cessing (EMNLP-IJCNLP), pages 367–377, Hong
Kong, China. Association for Computational Lin- 89.59 90.81 90.19
Re-TACRED
guistics. ±0.21 ±0.25 ±0.13

Xiangrong Zeng, Daojian Zeng, Shizhu He, Kang Liu, Table 6: Average micro metrics over 5 seeds for
and Jun Zhao. 2018. Extracting relational facts by REBELpre−training on dev sets. Standard deviation is
an end-to-end neural model with copy mechanism. indicated after the ± symbol.
In Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume
1: Long Papers), pages 506–514, Melbourne, Aus-
tralia. Association for Computational Linguistics. A.2 Reproducibility

Ranran Haoran Zhang, Qianying Liu, Aysa Xuemo


Experiments were performed using a single
Fan, Heng Ji, Daojian Zeng, Fei Cheng, Daisuke NVIDIA 3090 GPU with 64GB of RAM and Intel®
Kawahara, and Sadao Kurohashi. 2020. Minimize Core™ i9-10900KF CPU.
exposure bias of Seq2Seq models in joint entity The hyperparameters were manually tuned on
and relation extraction. In Findings of the Associa-
tion for Computational Linguistics: EMNLP 2020,
the validation sets for each dataset, but mostly left
pages 236–246, Online. Association for Computa- at default values for BART. The ones used for the
tional Linguistics. final results can be found in Table 7. The number
Max epochs Learning Rate Warm-up Weight Decay Batch size Time per epoch
CONLL04 33 10−5 10% 0.01 32 30 sec
NYT 42 2.5 · 10−5 10% 0.1 24 8 min
DocRED 20 10−5 10% 0.01 32 2 min
ADE 25 10−5 10% 0.01 32 1 min
Re-TACRED 6 10−5 10% 0.01 32 8.5 min
REBEL 3 10−5 1000 steps 0 32 9 hours

Table 7: Hyperparameters for the different datasets.

Precision Recall F1
74.69 71.66 73.14
CONLL04
±0.76 ±1.01 ±0.73
91.44 92.02 91.72
NYT
±0.12 ±0.15 ±0.10
46.27 35.92 40.40
DocRED
±1.17 ±1.81 ±0.86
89.31 90.87 90.08
Re-TACRED
±0.20 ±0.41 ±0.19

Table 8: Average micro metrics over 5 seeds for


REBEL on dev sets. Standard deviation is indicated
after the ± symbol.

of parameters for REBEL is the same as for BART-


large, 406M parameters, with a negligible increase
from the newly added tokens.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy