Abstract
Large language models can perform well on general natural language tasks, but their effectiveness is still suboptimal for information extraction (IE). Recent works indicate that the main reason lies in the lack of extensive data on IE instructions. Note that the existing datasets on IE instructions not only have limited coverage but also involve high construction costs. To address this issue, we introduce InstructIE, a bilingual instruction-based IE dataset, which covers 12 diverse domains. We propose KG2Instruction, a framework specifically for the automatic generation of such datasets. Additionally, we manually annotate the test set. Experimental results demonstrate that large language models trained with InstructIE can not only obtain better IE capabilities but also enhance zero-shot performance compared with baselines.
Resource Type: New Dataset
Source Repo: https://huggingface.co/datasets/zjunlp/InstructIE
DOI: https://doi.org/10.5281/zenodo.10970777
License: Attribution-NonCommercial-ShareAlike 4.0 International
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
References
Baichuan: Baichuan 2: open large-scale language models. arXiv preprint arXiv:2309.10305 (2023). https://arxiv.org/abs/2309.10305
Brown, T.B., Mann, B., Ryder, N., et al.: Language models are few-shot learners. In: NeurIPS 2020 (2020)
Carreras, X., Màrquez, L.: Introduction to the CONLl-2004 shared task: semantic role labeling. In: Ng, H.T., Riloff, E. (eds.) Proceedings of the Eighth Conference on Computational Natural Language Learning, CoNLL 2004, Held in cooperation with HLT-NAACL 2004, Boston, Massachusetts, USA, May 6-7, 2004, pp. 89–97. ACL (2004), https://aclanthology.org/W04-2412/
Chen, C., Li, C.: ZS-BERT: towards zero-shot relation extraction with attribute representation learning. In: Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tür, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y. (eds.) Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT2021, Online, June 6-11, 2021, pp. 3470–3479. Association for Computational Linguistics (2021). https://doi.org/10.18653/V1/2021.NAACL-MAIN.272
Gao, J., Zhao, H., Zhang, Y., Wang, W., Yu, C., Xu, R.: Benchmarking large language models with augmented instructions for fine-grained information extraction. CoRR abs/2310.05092 (2023). https://doi.org/10.48550/ARXIV.2310.05092
Gui, H., et al.: IEPile: unearthing large-scale schema-based information extraction corpus. CoRR abs/2402.14710 (2024). https://doi.org/10.48550/ARXIV.2402.14710
Gurulingappa, H., Rajput, A.M., Toldo, L.: Extraction of adverse drug effects from medical case reports. J. Biomed. Semant. 3, 15 (2012). https://doi.org/10.1186/2041-1480-3-15
Han, X., et al.: FewRel: a large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 4803–4809. Association for Computational Linguistics (2018). https://doi.org/10.18653/V1/D18-1514
He, H., Choi, J.D.: The stem cell hypothesis: dilemma behind multi-task learning with transformer encoders. In: EMNLP 2021, pp. 5555–5577. Association for Computational Linguistics (2021)
Hendrickx, I., et al.: SemEval-2010 Task 8: multi-way classification of semantic relations between pairs of nominals. In: Erk, K., Strapparava, C. (eds.) Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval@ACL 2010, Uppsala University, Uppsala, Sweden, July 15-16, 2010, pp. 33–38. The Association for Computer Linguistics (2010). https://aclanthology.org/S10-1006/
Huang, K., et al.: A reevaluation of event extraction: Past, present, and future challenges. CoRR abs/2311.09562 (2023). https://doi.org/10.48550/ARXIV.2311.09562
Jat, S., Khandelwal, S., Talukdar, P.P.: Improving distantly supervised relation extraction using word and entity based attention. In: 6th Workshop on Automated Knowledge Base Construction, AKBC@NIPS 2017, Long Beach, California, USA, December 8, 2017. OpenReview.net (2017)
Jiao, Y., et al.: Instruct and Extract: instruction tuning for on-demand information extraction. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp. 10030–10051. Association for Computational Linguistics (2023). https://aclanthology.org/2023.emnlp-main.620
Kocaman, V., Talby, D.: Biomedical named entity recognition at scale. In: Del Bimbo, A., Cucchiara, R., Sclaroff, S., Farinella, G.M., Mei, T., Bertini, M., Escalante, H.J., Vezzani, R. (eds.) Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part I, pp. 635–646. Springer International Publishing, Cham (2021). https://doi.org/10.1007/978-3-030-68763-2_48
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: HLT-NAACL 2016, pp. 260–270. The Association for Computational Linguistics (2016)
Levow, G.: The third international Chinese language processing bakeoff: word segmentation and named entity recognition. In: Ng, H.T., Kwong, O.O.Y. (eds.) Proceedings of the Fifth Workshop on Chinese Language Processing, SIGHAN@COLING/ACL 2006, Sydney, Australia, July 22-23, 2006, pp. 108–117. Association for Computational Linguistics (2006). https://aclanthology.org/W06-0115/
Li, B., et al.: Evaluating ChatGPT’s information extraction capabilities: an assessment of performance, explainability, calibration, and faithfulness. CoRR abs/2304.11633 (2023). https://doi.org/10.48550/ARXIV.2304.11633
Li, P., et al.: CodeIE: large code generation models are better few-shot information extractors. In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 15339–15353. Association for Computational Linguistics (2023). https://doi.org/10.18653/V1/2023.ACL-LONG.855
Li, S., et al.: DuIE: a large-scale Chinese dataset for information extraction. In: Tang, J., Kan, M.-Y., Zhao, D., Li, S., Zan, H. (eds.) NLPCC 2019. LNCS (LNAI), vol. 11839, pp. 791–800. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32236-6_72
Li, X., Feng, J., Meng, Y., Han, Q., Wu, F., Li, J.: A unified MRC framework for named entity recognition. In: ACL 2020, pp. 5849–5859. Association for Computational Linguistics (2020)
Li, Z., et al.: KnowCoder: coding structured knowledge into LLMs for universal information extraction. CoRR abs/2403.07969 (2024). https://doi.org/10.48550/ARXIV.2403.07969
Liu, X., Huang, H., Shi, G., Wang, B.: Dynamic prefix-tuning for generative template-based event extraction. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp. 5216–5228. Association for Computational Linguistics (2022). https://doi.org/10.18653/V1/2022.ACL-LONG.358
Lou, J., et al.: Universal information extraction as unified semantic matching. In: Williams, B., Chen, Y., Neville, J. (eds.) Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pp. 13318–13326. AAAI Press (2023).https://doi.org/10.1609/aaai.v37i11.26563
Lu, K., Pan, X., Song, K., Zhang, H., Yu, D., Chen, J.: PIVOINE: instruction tuning for open-world information extraction. CoRR abs/2305.14898 (2023). https://doi.org/10.48550/arXiv.2305.14898
Lu, Y., et al.: Unified structure generation for universal information extraction. In: ACL 2022, pp. 5755–5772. Association for Computational Linguistics (2022)
Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 3219–3232. Association for Computational Linguistics (2018). https://doi.org/10.18653/V1/D18-1360
Ma, Y., Cao, Y., Hong, Y., Sun, A.: Large language model is not a good few-shot information extractor, but a good reranker for hard samples! In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pp. 10572–10601. Association for Computational Linguistics (2023). https://aclanthology.org/2023.findings-emnlp.710
Mihindukulasooriya, N., Tiwari, S., Enguix, C.F., Lata, K.: Text2KGBench: a benchmark for ontology-driven knowledge graph generation from text. In: Payne, T.R., Presutti, V., Qi, G., Poveda-Villalón, M., Stoilos, G., Hollink, L., Kaoudi, Z., Cheng, G., Li, J. (eds.) The Semantic Web – ISWC 2023: 22nd International Semantic Web Conference, Athens, Greece, November 6–10, 2023, Proceedings, Part II, pp. 247–265. Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-47243-5_14
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Su, K.Y., Su, J., Wiebe, J., Li, H. (eds.) Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1003–1011. Association for Computational Linguistics, Suntec, Singapore (Aug 2009). https://aclanthology.org/P09-1113
OpenAI: GPT-4 technical report. CoRR abs/2303.08774 (2023). https://doi.org/10.48550/ARXIV.2303.08774
Ouyang, L., Wu, J., Jiang, X., et al.: Training language models to follow instructions with human feedback. In: NeurIPS 2022 (2022)
Paolini, G., et al.: Structured prediction as translation between augmented natural languages. In: ICLR 2021. OpenReview.net (2021)
Riedel, S., Yao, L., McCallum, A.: Modeling relations and their mentions without labeled text. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS (LNAI), vol. 6323, pp. 148–163. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15939-8_10
Sainz, O., García-Ferrero, I., Agerri, R., de Lacalle, O.L., Rigau, G., Agirre, E.: GoLLIE: annotation guidelines improve zero-shot information-extraction. CoRR abs/2310.03668 (2023). https://doi.org/10.48550/ARXIV.2310.03668
Sainz, O., de Lacalle, O.L., Labaka, G., Barrena, A., Agirre, E.: Label verbalization and entailment for effective zero and few-shot relation extraction. In: Moens, M., Huang, X., Specia, L., Yih, S.W. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp. 1199–1212. Association for Computational Linguistics (2021). https://doi.org/10.18653/V1/2021.EMNLP-MAIN.92
Sang, E.F.T.K., Meulder, F.D.: Introduction to the CONLL-2003 shared task: language-independent named entity recognition. In: Daelemans, W., Osborne, M. (eds.) Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003, Edmonton, Canada, May 31 - June 1, 2003, pp. 142–147. ACL (2003). https://aclanthology.org/W03-0419/
Satyapanich, T., Ferraro, F., Finin, T.: CASIE: extracting cybersecurity event information from text. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 8749–8757. AAAI Press (2020). https://doi.org/10.1609/AAAI.V34I05.6401
Smirnova, A., Cudré-Mauroux, P.: Relation extraction using distant supervision: a survey. ACM Comput. Surv. 51(5), 1–35 (2019). https://doi.org/10.1145/3241741
Sun, Z., et al.: PHEE: a dataset for pharmacovigilance event extraction from text. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp. 5571–5587. Association for Computational Linguistics (2022).https://doi.org/10.18653/V1/2022.EMNLP-MAIN.376
Touvron, H., Lavril, T., Izacard, G., et al.: LLaMA: open and efficient foundation language models. CoRR abs/2302.13971 (2023)
Touvron, H., Martin, L., Stone, K., et al.: LLaMA 2: open foundation and fine-tuned chat models. CoRR abs/2307.09288 (2023). https://doi.org/10.48550/arXiv.2307.09288
Vania, C., Lee, G., Pierleoni, A.: Improving distantly supervised document-level relation extraction through natural language inference. In: Cherry, et al: (eds.) Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pp. 14–20. Association for Computational Linguistics, Hybrid (Jul 2022). https://doi.org/10.18653/v1/2022.deeplo-1.2
Wadhwa, S., Amir, S., Wallace, B.C.: Revisiting relation extraction in the era of large language models. In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 15566–15589. Association for Computational Linguistics (2023). https://doi.org/10.18653/V1/2023.ACL-LONG.868
Wan, Z., et al.: GPT-RE: in-context learning for relation extraction using large language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp. 3534–3547. Association for Computational Linguistics (2023). https://aclanthology.org/2023.emnlp-main.214
Wang, H., He, Z., Ma, J., Chen, W., Zhang, M.: IPRE: a dataset for inter-personal relationship extraction. In: Tang, J., Kan, M.-Y., Zhao, D., Li, S., Zan, H. (eds.) NLPCC 2019. LNCS (LNAI), vol. 11839, pp. 103–115. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32236-6_9
Wang, J., et al.: TechGPT-2.0: a large language model project to solve the task of knowledge graph construction (2024)
Wang, S., et al.: GPT-NER: named entity recognition via large language models. CoRR abs/2304.10428 (2023). https://doi.org/10.48550/arXiv.2304.10428
Wang, X., et al.: InstructUIE: multi-task instruction tuning for unified information extraction. CoRR abs/2304.08085 (2023). https://doi.org/10.48550/ARXIV.2304.08085
Wei, X., et al.: Zero-shot information extraction via chatting with ChatGPT. CoRR abs/2302.10205 (2023). https://doi.org/10.48550/arXiv.2302.10205
Whitehouse, C., Vania, C., Aji, A.F., Christodoulopoulos, C., Pierleoni, A.: WebIE: faithful and robust information extraction on the web. In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 7734–7755. Association for Computational Linguistics (2023). https://doi.org/10.18653/v1/2023.acl-long.428
Xiao, X., et al.: YAYI-UIE: a chat-enhanced instruction tuning framework for universal information extraction. CoRR abs/2312.15548 (2023).https://doi.org/10.48550/ARXIV.2312.15548
Xie, T., Li, Q., Zhang, Y., Liu, Z., Wang, H.: Self-improving for zero-shot named entity recognition with large language models. CoRR abs/2311.08921 (2023). https://doi.org/10.48550/ARXIV.2311.08921
Xu, D., et al.: Large language models for generative information extraction: a survey. CoRR abs/2312.17617 (2023). https://doi.org/10.48550/ARXIV.2312.17617
Xue, L., et al.: mT5: a massively multilingual pre-trained text-to-text transformer. In: HLT-NAACL 2021, pp. 483–498. Association for Computational Linguistics (2021)
Ye, H., Gui, H., Xu, X., Chen, H., Zhang, N.: Schema-adaptable knowledge graph construction (2023)
Ye, H., Zhang, N., Chen, H., Chen, H.: Generative knowledge graph construction: a review. In: EMNLP 2022, pp. 1–17. Association for Computational Linguistics (2022)
Zeng, D., Zhang, H., Liu, Q.: CopyMTL: copy mechanism for joint extraction of entities and relations with multi-task learning. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 9507–9514. AAAI Press (2020). https://doi.org/10.1609/AAAI.V34I05.6495
Zheng, S., Wang, F., Bao, H., Hao, Y., Zhou, P., Xu, B.: Joint extraction of entities and relations based on a novel tagging scheme. In: ACL 2017, pp. 1227–1236. Association for Computational Linguistics (2017)
Zhou, W., Zhang, S., Gu, Y., Chen, M., Poon, H.: UniversaLNER: targeted distillation from large language models for open named entity recognition. CoRR abs/2308.03279 (2023). https://doi.org/10.48550/ARXIV.2308.03279
Acknowledgements
We would like to express our sincere gratitude to the anonymous reviewers for their thoughtful and constructive feedback. This work was supported by the National Natural Science Foundation of China (No. 62206246, No. NSFCU23B2055, No. NSFCU19B2027), the Fundamental Research Funds for the Central Universities (226-2023-00138), Zhejiang Provincial Natural Science Foundation of China (No. LGG22F030011), Yongjiang Talent Introduction Programme (2021A-156-G), and Information Technology Center and State Key Lab of CAD&CG, Zhejiang University. This work was supported by Ant Group and Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Gui, H. et al. (2025). InstructIE: A Bilingual Instruction-based Information Extraction Dataset. In: Demartini, G., et al. The Semantic Web – ISWC 2024. ISWC 2024. Lecture Notes in Computer Science, vol 15233. Springer, Cham. https://doi.org/10.1007/978-3-031-77847-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-77847-6_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-77846-9
Online ISBN: 978-3-031-77847-6
eBook Packages: Computer ScienceComputer Science (R0)