EMNLP 2021 REBEL Camera Ready
EMNLP 2021 REBEL Camera Ready
essarily mean that the relation is entailed within datasets, including our pre-training dataset, can be
the text. Although in Elsahar et al. (2018) high found in Table 1.
reliability is claimed using this method, it has been While the training objective is on the autoregres-
shown to be noisy for frequent relations such as sive task, we evaluate the model on RE, extracting
country or spouse, and we have found several re- all the triplets from the generated output, and eval-
lated annotation issues. We utilize a pre-trained uating using Recall, Precision, and micro-F1 based
RoBERTa (Liu et al., 2019) Natural Language In- on the labeled triplets. For a triplet to be considered
ference (NLI) model4 to tackle this issue, and use correct, the entities and the relation, as well as their
its entailment prediction to filter those relations not types, have to be the same as the labeled ones (this
entailed by the Wikipedia text. For each triplet, is known as “strict” evaluation in RE) using the
we input the text containing both entities from the evaluation code from Taillé et al. (2020).
Wikipedia abstract, and the triplet in their surface
forms, subject + relation + object, sepa- 4.1 REBEL dataset
rated by the <sep> token. We create this dataset by matching Wikipedia hy-
For the previous example and the triplet (Talking perlinks with Wikidata entities as explained in Sec-
Heads, genre, new wave), we input: “This Must tion 3.2. To pre-train our model, we use a sentence-
Be the Place” is a song by new wave band Talking level version of it, where only relations between
Heads, released in November 1983 as the second entities present in each sentence are kept. We keep
single from its fifth album “Speaking in Tongues”. the 220 most frequent relations in the train split.
<sep> Talking Heads genre new wave. We keep We fine-tune REBEL (using BART-large as the
those triplets for which the entailment prediction base model) on the silver dataset for 6 epochs. We
is higher than 0.75. This proves successful in cre- refer to the resulting model as REBELpre−training .
ating cleaner data in preliminary experiments and While REBELpre−training is in and of itself capa-
removing noisy annotations. We create three ran- ble of extracting relations subsuming about 220
dom splits, with validation and test each being 5% types, we show that it also functions as a base step
of the total data. for downstream RE and RC tasks, which are fine-
While this data extraction pipeline may still keep tuned on top of it.
some noise, or exclude some relations that are en-
tailed by the text, it enables an automatic way 4.2 CONLL04
of gathering millions of entities and relations as CONLL04 (Roth and Yih, 2004) is composed of
a silver dataset, sufficient for training our model. sentences from news articles, annotated with four
We name our RE dataset creation tool cRocoDiLe: entity types (person, organization, location and
Automatic Relation Extraction Dataset with NLI other) and five relation types (kill, work for, or-
filtering, and we make it available here5 . ganization based in, live in and located in). To
compare with previous work, we use the test split
4 Experimental Setup from Gupta et al. (2016), and the same validation
In this section, we describe the setup to train and set as Eberts and Ulges (2020), although we do not
evaluate REBEL for four different widely used RE include the validation set at final training time.
datasets and one RC dataset. Statistics for all the For CONLL04 we expand REBEL to include
4 entity types. As described in Section 3.1, we intro-
xlm-roberta-large-xnli
5
https://github.com/Babelscape/ duce a set of new tokens for each entity type. For
crocodile CONLL04 these are <peop>, <org>, <loc>,
<other>. We fine-tune on top of REBEL for 30 the single relation Adverse-Effect. Thus, we keep
epochs and test on the best performing epoch on the same setup as with REBEL, using the <subj>
the validation set. token to distinguish between entity types, and re-
moving the relation from the output, as it is always
4.3 DocRED the same.
DocRED (Yao et al., 2019) is a recent dataset cre- We fine-tune on top of REBEL for 25 epochs
ated similarly to our pre-training data, by leverag- and evaluate using the last checkpoint for each fold
ing Wikipedia and Wikidata. However, it focuses in the dataset. Hyperparameters are selected by
on longer spans of text, with relations between using 10% of the training data in the first fold.
entities at a document level. There is a distantly su-
pervised portion, while the validation and (hidden) 4.6 Re-TACRED
test sets are manually annotated. It includes anno- Re-TACRED (Stoica et al., 2021) is a Relation Clas-
tations for 6 different entity types and 96 relation sification dataset, a revised version of the widely
types. used TACRED (Zhang et al., 2017), fixing some of
Despite the fact that DocRED was originally the issues pointed out by Alt et al. (2020). We want
designed as a relation classification task, we use to extract the relation between two given entities, or
the splits from Eberts and Ulges (2021) and tackle the no_relation prediction, accounting for 63% of
it as a relation extraction task. In DocRED there the 91,467 sentences in the dataset. To this end, we
are 6 entity types, consequently we use the tokens: follow the approach from Zhou and Chen (2021)
<loc>, <misc>, <per>, <num>, <time> and and Zhou and Chen (2021) and mark the entities in
<org> to indicate them. the input text using punctuation marks. We do not
We fine-tune on top of REBEL for 21 epochs include any entity-type information.
and test on the last checkpoint, using a beam search The output is treated as in previous tasks, and
of 10. For REBELpre−training , we use a version we do not force the decoding of the given entities,
trained on a filtered dataset not including any of as we find it is sufficient to mark them in the input.
the Wikipedia pages present in DocRED validation We fine-tune on top of REBEL for 8 epochs and
or test sets. evaluate using the last checkpoint.
4.4 NYT 5 Results
NYT (Riedel et al., 2010) is a dataset consisting of
5.1 Relation Extraction
news sentences from the New York Times corpus.
The dataset contains distantly annotated relations For our pre-training task using the REBEL dataset,
using FreeBase. We use the processed version of the model achieves 74 micro-F1 and 51 macro-F1.
Zeng et al. (2018) called NYT-multi, which con- The dataset is created by distant supervision and
tains overlapping entities, with three different entity serves as a pre-training step, however, it is worth
types, and 24 relation types. noting its performance for predicting up to 220
We use <loc>, <per> and <org> to indicate different relation types.
the 3 entity types. As for the 24 relation types, we Results on selected baselines are presented in Ta-
map these to natural language expressions to match ble 2, as well as additional metrics in Tables 3 and
those seen at pre-training. 4. We see an improvement across all datasets with
We fine-tune on top of REBEL for a maximum pre-trained REBEL, achieving between 1.2 and
of 42 epochs and test on the best performing epoch 6.7 absolute F1 points improvement over recent
on the validation set. state-of-the-art models. Using REBEL without the
pre-training, we see that performance decreases,
4.5 ADE especially for smaller datasets or those with many
ADE (Gurulingappa et al., 2012) is a dataset on entity types. Nevertheless, it still achieves com-
the biomedical domain, for which Adverse-Effects petitive results, showing the flexibility of tackling
from drugs are annotated as pairs of drug and RE as a seq2seq task using Transformer Encoder-
adverse-effect. The dataset provides 10-folds of Decoder models.
train and test splits. Additionally, REBEL shows a better perfor-
Drug and Adverse-Effect are the two entity types, mance than TANL, which was trained in a seq2seq
and are always the subject and object entities for fashion as well, using T5, with BART achieving
CONLL04 NYT DocRED ADE
Strict Evaluation
SpERT (Eberts and Ulges, 2020) 71.5† - - 79.2
Table-sequence (Wang and Lu, 2020) 73.6 - - 80.1‡
JEREX (Eberts and Ulges, 2021) - - 40.4 -
TANL (Paolini et al., 2021) 71.4† 90.8 - 80.6
TANL (multi-dataset) (Paolini et al., 2021) 72.6† 90.5 - 80.0
REBEL 71.2 91.8 41.8 81.7
REBELpre−training 75.4 92.0 47.1 82.2
Boundaries Evaluation
TPLinker (Wang et al., 2020) - 91.9 - -
REBELpre−training - 93.4 - -
Table 2: Comparison (Micro-F1) with most recent systems. † = explicit use of train+dev ‡ = filtered overlapping
entities (2.8%)
Table 3: Average micro metrics over 5 seeds (10-folds Table 4: Average micro metrics over 5 seeds for
for ADE) for REBELpre−training . Standard deviation REBEL on test sets. Standard deviation is indicated
is indicated after the ± symbol. after the ± symbol.
lower results for their approach. Therefore, our be expensive to train, and shorter training time can
triplet linearization approach shows an improve- significantly decrease the costs.
ment over other decoding strategies.
Results on RE for DocRED show that, despite 5.2 Budget Training
being pre-trained on a sentence-based RE, REBEL We explore the training efficiency of
can perform competitively on document-level RE, REBELpre_trained , and show the performance
without the need for complex pipelines. when fine-tuned on a low number of epochs.
Moreover, by having a pre-trained version avail- We experiment with CONLL04 and NYT com-
able, REBEL enables quick fine-tuning on newer pared to the non-pre-trained model, SpERT and
domains, such as ADE, with different or fewer rela- TANL. SpERT was trained for just 20 epochs on
tion types, or including entity types. While in order CONLL04, while TANL in its non-multi-dataset
to achieve the best performance we train for longer version is trained for 200 epochs. We adjust each
epochs, REBEL still needs fewer training steps to learning rate scheduler to the number of epochs
achieve competitive results compared to the other and re-train each model for different epochs and
systems. For instance, Paolini et al. (2021) train seeds.
CONLL04 for up to 200 epochs, Wang and Lu Figures 2 and 3 show how in just 8 epochs for
(2020) for up to 5,000, while our model needs less CONLL04 and 3 for NYT, REBELpre_trained can
than 30 to achieve state-of-the-art results. Each of achieve a similar performance as the previous state
these systems uses large language models that can of the art. While the experiments are on the dev set,
74 F1
72
LUKE (Yamada et al., 2020) 90.3
RoBERTaLARGE
70 90.5
+ entity marker (Zhou and Chen, 2021)
F1
68 REBEL 90.4
66 REBELpre−training 90.4
REBEL + pre-training
REBEL
64 TANL Table 5: Results on Re-TACRED
SpERT
6 7 8 9 10 11 12
epochs
5.3 Relation Classification
Figure 2: Micro-F1 performances on CONLL04 dev As Table 5 shows, REBEL performs fairly well on
set averaged over 5 seeds. RC despite being designed for RE. While Zhou and
Chen (2021) presented a model with better results
(91.1 F1) using entity types, we compare our mod-
90 els with those that do not use them. Both versions
88 of REBEL achieve the same performance, in this
86 case, in contrast to what we saw with RE. This may
be due to the pre-training task being solely RE, as
F1
84
well as the size of the dataset.
82
REBEL + pre-training For REBEL, we evaluate using free generation
80 REBEL
TANL in the RC setup. Paolini et al. (2021) use likelihood-
78 Spert
based prediction which leads to an increase in per-
1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
epochs formance by computing the likelihood of each rela-
tion type to be decoded with the two given entities.
Figure 3: Micro-F1 performances on NYT dev set av- However, this also leads to an overhead of compu-
eraged over 3 seeds. tation for datasets with a high number of relations
such as Re-TACRED. For this reason, we use free
generation and are unable to compute results for
we do not observe big differences in performance Re-TACRED using TANL.
between test and dev for these two datasets (see
6 Conclusion
Appendix A.1 Tables 6 and 8). These results also
highlight the importance of pre-training REBEL, We have presented REBEL, alongside a new dis-
as it achieves close to the final performance within tantly supervised dataset for pre-training. REBEL
a few epochs. Also note that while other mod- frames RE into a seq2seq task and, by leveraging
els achieve lower performances, they also reach BART, achieves state-of-the-art performances in
close to their final ones. Training for longer times an array of RE benchmarks. We have also shown
and using early stopping on the validation perfor- its flexibility in adapting to new domains, by train-
mance are approaches used by most state-of-the-art ing on just a few epochs to attain results that are
models, but this can lead to long and expensive comparable to the previous state of the art, as well
training times. Our experiments show that training as the possibility of using it to perform Relation
for fewer epochs may lead to a small decrease in Classification.
performance, but it brings the benefit of a more We make REBELpre−training available as a stan-
affordable training time. The comparison with dalone RE for more than 200 relation types together
other models should also take into account that our with a pre-trained RE model to serve as a baseline
pre-trained approach has been previously trained when fine-tuning on new RE datasets. Nonetheless,
on a massive dataset for 6 epochs, which combined REBEL is based on BART-large, which has a big
with the fine-tuning in this experiment would lead parameter footprint. Therefore, we also plan to re-
to longer training times. However, all the other lease a pre-trained REBEL-base using BART-base.
models also rely on pre-trained LM and, similarly, This will enable quick and efficient RE.
REBEL just needs to be pre-trained once and then Moreover, our dataset creation pipeline enables
quickly fine-tuned on these new datasets. a quick and effortless way of obtaining large high-
quality RE datasets in multiple languages from a Rexhina Blloshmi, Simone Conia, Rocco Tripodi, and
Wikipedia dump. Since both Wikipedia and Wiki- Roberto Navigli. 2021. Generating senses and roles:
An end-to-end model for dependency- and span-
data are in constant change, our method provides
based semantic role labeling. In Proceedings of the
a way to keep up with those changes and to have Thirtieth International Joint Conference on Artifi-
up-to-date RE datasets. cial Intelligence, IJCAI-21, pages 3786–3793. Inter-
We leave to future work the possibility of using national Joint Conferences on Artificial Intelligence
a multi-dataset approach as in Paolini et al. (2021), Organization. Main Track.
including both RE and RC datasets, and seeing if Nicola De Cao, Gautier Izacard, Sebastian Riedel, and
it retains or improves performance. Furthermore, Fabio Petroni. 2021. Autoregressive entity retrieval.
using our silver dataset as pre-training could lead In International Conference on Learning Represen-
to improved performance for other systems, espe- tations.
cially those which have shown better performance
Rich Caruana. 1998. Multitask Learning, pages 95–
than REBEL without pre-training, such as Wang 133. Springer US, Boston, MA.
and Lu (2020) for CONLL04.
Joachim Daiber, Max Jakob, Chris Hokamp, and
Acknowledgments Pablo N. Mendes. 2013. Improving efficiency and
accuracy in multilingual entity extraction. In Pro-
We would like to thank the authors of Elsahar ceedings of the 9th International Conference on Se-
et al. (2018) for the T-REx open code from which mantic Systems, I-SEMANTICS ’13, page 121–124,
cRocoDiLe was built. New York, NY, USA. Association for Computing
This research was funded by the European Machinery.
Union’s H2020 Marie Skłodowska-Curie project
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Knowledge Graphs at Scale (KnowGraphs) under Kristina Toutanova. 2019. BERT: Pre-training of
H2020-EU.1.3.1. (grant agreement ID: 860801). deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
References for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
Heike Adel and Hinrich Schütze. 2017. Global nor- pages 4171–4186, Minneapolis, Minnesota. Associ-
malization of convolutional neural networks for joint ation for Computational Linguistics.
entity and relation classification. In Proceedings
of the 2017 Conference on Empirical Methods in Markus Eberts and Adrian Ulges. 2020. Span-based
Natural Language Processing, pages 1723–1729, joint entity and relation extraction with transformer
Copenhagen, Denmark. Association for Computa- pre-training. In ECAI, pages 2006–2013.
tional Linguistics.
Christoph Alt, Aleksandra Gabryszak, and Leonhard Markus Eberts and Adrian Ulges. 2021. An end-to-end
Hennig. 2020. TACRED revisited: A thorough eval- model for entity-level relation extraction using multi-
uation of the TACRED relation extraction task. In instance learning. In Proceedings of the 16th Con-
Proceedings of the 58th Annual Meeting of the Asso- ference of the European Chapter of the Association
ciation for Computational Linguistics, pages 1558– for Computational Linguistics: Main Volume, pages
1569, Online. Association for Computational Lin- 3650–3660, Online. Association for Computational
guistics. Linguistics.
Giusepppe Attardi. 2015. Wikiextractor. https:// Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci,
github.com/attardi/wikiextractor. Christophe Gravier, Jonathon Hare, Frederique
Michele Bevilacqua, Rexhina Blloshmi, and Roberto Laforest, and Elena Simperl. 2018. T-REx: A large
Navigli. 2021. One spring to rule them both: Sym- scale alignment of natural language with knowledge
metric amr semantic parsing and generation with- base triples. In Proceedings of the Eleventh Interna-
out a complex pipeline. Proceedings of the AAAI tional Conference on Language Resources and Eval-
Conference on Artificial Intelligence, 35(14):12564– uation (LREC 2018), Miyazaki, Japan. European
12573. Language Resources Association (ELRA).
Michele Bevilacqua, Marco Maru, and Roberto Nav- Pankaj Gupta, Hinrich Schütze, and Bernt Andrassy.
igli. 2020. Generationary or “how we went beyond 2016. Table filling multi-task recurrent neural net-
word sense inventories and learned to gloss”. In work for joint entity and relation extraction. In Pro-
Proceedings of the 2020 Conference on Empirical ceedings of COLING 2016, the 26th International
Methods in Natural Language Processing (EMNLP), Conference on Computational Linguistics: Techni-
pages 7207–7221, Online. Association for Computa- cal Papers, pages 2537–2547, Osaka, Japan. The
tional Linguistics. COLING 2016 Organizing Committee.
Harsha Gurulingappa, Abdul Mateen Rajput, Angus Soatto. 2021. Structured prediction as translation be-
Roberts, Juliane Fluck, Martin Hofmann-Apitius, tween augmented natural languages. In 9th Inter-
and Luca Toldo. 2012. Development of a benchmark national Conference on Learning Representations,
corpus to support the automatic extraction of drug- ICLR 2021.
related adverse effects from medical case reports.
Journal of Biomedical Informatics, 45(5):885–892. Sachin Pawar, Pushpak Bhattacharyya, and Girish Pal-
Text Mining and Natural Language Processing in shikar. 2017. End-to-end relation extraction using
Pharmacogenomics. neural networks and Markov Logic Networks. In
Proceedings of the 15th Conference of the European
Arzoo Katiyar and Claire Cardie. 2017. Going out on Chapter of the Association for Computational Lin-
a limb: Joint extraction of entity mentions and re- guistics: Volume 1, Long Papers, pages 818–827,
lations without dependency trees. In Proceedings Valencia, Spain. Association for Computational Lin-
of the 55th Annual Meeting of the Association for guistics.
Computational Linguistics (Volume 1: Long Papers),
pages 917–928, Vancouver, Canada. Association for Colin Raffel, Noam Shazeer, Adam Roberts, Kather-
Computational Linguistics. ine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, the limits of transfer learning with a unified text-to-
Kevin Gimpel, Piyush Sharma, and Radu Soricut. text transformer. Journal of Machine Learning Re-
2020. ALBERT: A lite BERT for self-supervised search, 21(140):1–67.
learning of language representations. In 8th Inter-
national Conference on Learning Representations, Sebastian Riedel, Limin Yao, and Andrew McCal-
ICLR 2020, Addis Ababa, Ethiopia, April 26-30, lum. 2010. Modeling relations and their mentions
2020. without labeled text. In Machine Learning and
Knowledge Discovery in Databases, pages 148–163,
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- Berlin, Heidelberg. Springer Berlin Heidelberg.
jan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer. Dan Roth and Wen-tau Yih. 2004. A linear program-
2020. BART: Denoising sequence-to-sequence pre- ming formulation for global inference in natural lan-
training for natural language generation, translation, guage tasks. In Proceedings of the Eighth Confer-
and comprehension. In Proceedings of the 58th An- ence on Computational Natural Language Learn-
nual Meeting of the Association for Computational ing (CoNLL-2004) at HLT-NAACL 2004, pages 1–8,
Linguistics, pages 7871–7880, Online. Association Boston, Massachusetts, USA. Association for Com-
for Computational Linguistics. putational Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
George Stoica, Emmanouil Antonios Platanios, and
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Barnabas Poczos. 2021. Re-tacred: Addressing
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
shortcomings of the tacred dataset. Proceedings
Roberta: A robustly optimized bert pretraining ap-
of the AAAI Conference on Artificial Intelligence,
proach.
35(15):13843–13850.
Makoto Miwa and Mohit Bansal. 2016. End-to-end re-
lation extraction using LSTMs on sequences and tree Bruno Taillé, Vincent Guigue, Geoffrey Scoutheeten,
structures. In Proceedings of the 54th Annual Meet- and Patrick Gallinari. 2020. Let’s Stop Incorrect
ing of the Association for Computational Linguistics Comparisons in End-to-end Relation Extraction! In
(Volume 1: Long Papers), pages 1105–1116, Berlin, Proceedings of the 2020 Conference on Empirical
Germany. Association for Computational Linguis- Methods in Natural Language Processing (EMNLP),
tics. pages 3689–3701, Online. Association for Computa-
tional Linguistics.
Makoto Miwa and Yutaka Sasaki. 2014. Modeling
joint entity and relation extraction with table repre- Jue Wang and Wei Lu. 2020. Two are better than
sentation. In Proceedings of the 2014 Conference on one: Joint entity and relation extraction with table-
Empirical Methods in Natural Language Processing sequence encoders. In Proceedings of the 2020 Con-
(EMNLP), pages 1858–1869, Doha, Qatar. Associa- ference on Empirical Methods in Natural Language
tion for Computational Linguistics. Processing (EMNLP), pages 1706–1721, Online. As-
sociation for Computational Linguistics.
Tapas Nayak and Hwee Tou Ng. 2020. Effective mod-
eling of encoder-decoder architecture for joint entity Yucheng Wang, Bowen Yu, Yueyang Zhang, Tingwen
and relation extraction. Proceedings of the AAAI Liu, Hongsong Zhu, and Limin Sun. 2020.
Conference on Artificial Intelligence, 34(05):8528– TPLinker: Single-stage joint extraction of entities
8535. and relations through token pair linking. In Proceed-
ings of the 28th International Conference on Com-
Giovanni Paolini, Ben Athiwaratkun, Jason Krone, putational Linguistics, pages 1572–1582, Barcelona,
Jie Ma, Alessandro Achille, Rishita Anubhai, Ci- Spain (Online). International Committee on Compu-
cero Nogueira dos Santos, Bing Xiang, and Stefano tational Linguistics.
Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-
Takeda, and Yuji Matsumoto. 2020. LUKE: Deep geli, and Christopher D. Manning. 2017. Position-
contextualized entity representations with entity- aware attention and supervised data improve slot
aware self-attention. In Proceedings of the 2020 filling. In Proceedings of the 2017 Conference on
Conference on Empirical Methods in Natural Lan- Empirical Methods in Natural Language Processing,
guage Processing (EMNLP), pages 6442–6454, On- pages 35–45, Copenhagen, Denmark. Association
line. Association for Computational Linguistics. for Computational Linguistics.
Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Suncong Zheng, Yuexing Hao, Dongyuan Lu,
Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, Hongyun Bao, Jiaming Xu, Hongwei Hao, and
and Maosong Sun. 2019. DocRED: A large-scale Bo Xu. 2017. Joint entity and relation extraction
document-level relation extraction dataset. In Pro- based on a hybrid neural network. Neurocomput-
ceedings of the 57th Annual Meeting of the Associa- ing, 257:59–66. Machine Learning and Signal
tion for Computational Linguistics, pages 764–777, Processing for Big Multimedia Analysis.
Florence, Italy. Association for Computational Lin-
guistics. Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li,
Hongwei Hao, and Bo Xu. 2016. Attention-based
Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, bidirectional long short-term memory networks for
and Jun Zhao. 2014. Relation classification via con- relation classification. In Proceedings of the 54th
volutional deep neural network. In Proceedings of Annual Meeting of the Association for Computa-
COLING 2014, the 25th International Conference tional Linguistics (Volume 2: Short Papers), pages
on Computational Linguistics: Technical Papers, 207–212, Berlin, Germany. Association for Compu-
pages 2335–2344, Dublin, Ireland. Dublin City Uni- tational Linguistics.
versity and Association for Computational Linguis-
tics. Wenxuan Zhou and Muhao Chen. 2021. An im-
proved baseline for sentence-level relation extrac-
Daojian Zeng, Haoran Zhang, and Qianying Liu. 2020. tion. CoRR, abs/2102.01373.
Copymtl: Copy mechanism for joint extraction of
entities and relations with multi-task learning. In
The Thirty-Fourth AAAI Conference on Artificial In- A Appendix
telligence, AAAI 2020, The Thirty-Second Innova-
tive Applications of Artificial Intelligence Confer- A.1 Results
ence, IAAI 2020, The Tenth AAAI Symposium on Ed- Performances on the different dev sets can be found
ucational Advances in Artificial Intelligence, EAAI
2020, New York, NY, USA, February 7-12, 2020, in Tables 6 and 8.
pages 9507–9514. AAAI Press.
Precision Recall F1
Xiangrong Zeng, Shizhu He, Daojian Zeng, Kang Liu, 77.53 74.2 76.13
Shengping Liu, and Jun Zhao. 2019. Learning the CONLL04
±1.96 ±1.26 ±1.02
extraction order of multiple relational facts in a sen-
tence with reinforcement learning. In Proceedings 91.64 92.31 91.97
NYT
of the 2019 Conference on Empirical Methods in ±0.26 ±0.12 ±0.13
Natural Language Processing and the 9th Interna- 46.65 49.19 47.89
tional Joint Conference on Natural Language Pro- DocRED
±0.94 ±0.43 ±0.68
cessing (EMNLP-IJCNLP), pages 367–377, Hong
Kong, China. Association for Computational Lin- 89.59 90.81 90.19
Re-TACRED
guistics. ±0.21 ±0.25 ±0.13
Xiangrong Zeng, Daojian Zeng, Shizhu He, Kang Liu, Table 6: Average micro metrics over 5 seeds for
and Jun Zhao. 2018. Extracting relational facts by REBELpre−training on dev sets. Standard deviation is
an end-to-end neural model with copy mechanism. indicated after the ± symbol.
In Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume
1: Long Papers), pages 506–514, Melbourne, Aus-
tralia. Association for Computational Linguistics. A.2 Reproducibility
Precision Recall F1
74.69 71.66 73.14
CONLL04
±0.76 ±1.01 ±0.73
91.44 92.02 91.72
NYT
±0.12 ±0.15 ±0.10
46.27 35.92 40.40
DocRED
±1.17 ±1.81 ±0.86
89.31 90.87 90.08
Re-TACRED
±0.20 ±0.41 ±0.19