Skip to main content

InstructIE: A Bilingual Instruction-based Information Extraction Dataset

  • Conference paper
  • First Online:
The Semantic Web – ISWC 2024 (ISWC 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15233))

Included in the following conference series:

  • 237 Accesses

Abstract

Large language models can perform well on general natural language tasks, but their effectiveness is still suboptimal for information extraction (IE). Recent works indicate that the main reason lies in the lack of extensive data on IE instructions. Note that the existing datasets on IE instructions not only have limited coverage but also involve high construction costs. To address this issue, we introduce InstructIE, a bilingual instruction-based IE dataset, which covers 12 diverse domains. We propose KG2Instruction, a framework specifically for the automatic generation of such datasets. Additionally, we manually annotate the test set. Experimental results demonstrate that large language models trained with InstructIE can not only obtain better IE capabilities but also enhance zero-shot performance compared with baselines.

Resource Type: New Dataset

Source Repo: https://huggingface.co/datasets/zjunlp/InstructIE

DOI: https://doi.org/10.5281/zenodo.10970777

License: Attribution-NonCommercial-ShareAlike 4.0 International

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://zenodo.org/records/10970777.

  2. 2.

    https://www.wikidata.org.

  3. 3.

    https://www.wikipedia.org/.

  4. 4.

    https://huggingface.co/hfl/chinese-roberta-wwm-ext-large.

  5. 5.

    https://huggingface.co/FacebookAI/roberta-large.

  6. 6.

    https://huggingface.co/zjunlp/baichuan2-13b-iepile-lora.

  7. 7.

    https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7.

  8. 8.

    https://github.com/tloen/alpaca-lora.

  9. 9.

    https://github.com/Sewens/COAE2016.

  10. 10.

    https://huggingface.co/zjunlp/OneKE.

  11. 11.

    https://www.wikidata.org/.

  12. 12.

    https://www.wikipedia.org/.

  13. 13.

    https://huggingface.co/hfl/chinese-roberta-wwm-ext-large.

  14. 14.

    https://huggingface.co/FacebookAI/roberta-large.

  15. 15.

    https://huggingface.co/zjunlp/baichuan2-13b-iepile-lora.

  16. 16.

    https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7.

References

  1. Baichuan: Baichuan 2: open large-scale language models. arXiv preprint arXiv:2309.10305 (2023). https://arxiv.org/abs/2309.10305

  2. Brown, T.B., Mann, B., Ryder, N., et al.: Language models are few-shot learners. In: NeurIPS 2020 (2020)

    Google Scholar 

  3. Carreras, X., Màrquez, L.: Introduction to the CONLl-2004 shared task: semantic role labeling. In: Ng, H.T., Riloff, E. (eds.) Proceedings of the Eighth Conference on Computational Natural Language Learning, CoNLL 2004, Held in cooperation with HLT-NAACL 2004, Boston, Massachusetts, USA, May 6-7, 2004, pp. 89–97. ACL (2004), https://aclanthology.org/W04-2412/

  4. Chen, C., Li, C.: ZS-BERT: towards zero-shot relation extraction with attribute representation learning. In: Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tür, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y. (eds.) Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT2021, Online, June 6-11, 2021, pp. 3470–3479. Association for Computational Linguistics (2021). https://doi.org/10.18653/V1/2021.NAACL-MAIN.272

  5. Gao, J., Zhao, H., Zhang, Y., Wang, W., Yu, C., Xu, R.: Benchmarking large language models with augmented instructions for fine-grained information extraction. CoRR abs/2310.05092 (2023). https://doi.org/10.48550/ARXIV.2310.05092

  6. Gui, H., et al.: IEPile: unearthing large-scale schema-based information extraction corpus. CoRR abs/2402.14710 (2024). https://doi.org/10.48550/ARXIV.2402.14710

  7. Gurulingappa, H., Rajput, A.M., Toldo, L.: Extraction of adverse drug effects from medical case reports. J. Biomed. Semant. 3, 15 (2012). https://doi.org/10.1186/2041-1480-3-15

    Article  MATH  Google Scholar 

  8. Han, X., et al.: FewRel: a large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 4803–4809. Association for Computational Linguistics (2018). https://doi.org/10.18653/V1/D18-1514

  9. He, H., Choi, J.D.: The stem cell hypothesis: dilemma behind multi-task learning with transformer encoders. In: EMNLP 2021, pp. 5555–5577. Association for Computational Linguistics (2021)

    Google Scholar 

  10. Hendrickx, I., et al.: SemEval-2010 Task 8: multi-way classification of semantic relations between pairs of nominals. In: Erk, K., Strapparava, C. (eds.) Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval@ACL 2010, Uppsala University, Uppsala, Sweden, July 15-16, 2010, pp. 33–38. The Association for Computer Linguistics (2010). https://aclanthology.org/S10-1006/

  11. Huang, K., et al.: A reevaluation of event extraction: Past, present, and future challenges. CoRR abs/2311.09562 (2023). https://doi.org/10.48550/ARXIV.2311.09562

  12. Jat, S., Khandelwal, S., Talukdar, P.P.: Improving distantly supervised relation extraction using word and entity based attention. In: 6th Workshop on Automated Knowledge Base Construction, AKBC@NIPS 2017, Long Beach, California, USA, December 8, 2017. OpenReview.net (2017)

    Google Scholar 

  13. Jiao, Y., et al.: Instruct and Extract: instruction tuning for on-demand information extraction. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp. 10030–10051. Association for Computational Linguistics (2023). https://aclanthology.org/2023.emnlp-main.620

  14. Kocaman, V., Talby, D.: Biomedical named entity recognition at scale. In: Del Bimbo, A., Cucchiara, R., Sclaroff, S., Farinella, G.M., Mei, T., Bertini, M., Escalante, H.J., Vezzani, R. (eds.) Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part I, pp. 635–646. Springer International Publishing, Cham (2021). https://doi.org/10.1007/978-3-030-68763-2_48

    Chapter  MATH  Google Scholar 

  15. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: HLT-NAACL 2016, pp. 260–270. The Association for Computational Linguistics (2016)

    Google Scholar 

  16. Levow, G.: The third international Chinese language processing bakeoff: word segmentation and named entity recognition. In: Ng, H.T., Kwong, O.O.Y. (eds.) Proceedings of the Fifth Workshop on Chinese Language Processing, SIGHAN@COLING/ACL 2006, Sydney, Australia, July 22-23, 2006, pp. 108–117. Association for Computational Linguistics (2006). https://aclanthology.org/W06-0115/

  17. Li, B., et al.: Evaluating ChatGPT’s information extraction capabilities: an assessment of performance, explainability, calibration, and faithfulness. CoRR abs/2304.11633 (2023). https://doi.org/10.48550/ARXIV.2304.11633

  18. Li, P., et al.: CodeIE: large code generation models are better few-shot information extractors. In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 15339–15353. Association for Computational Linguistics (2023). https://doi.org/10.18653/V1/2023.ACL-LONG.855

  19. Li, S., et al.: DuIE: a large-scale Chinese dataset for information extraction. In: Tang, J., Kan, M.-Y., Zhao, D., Li, S., Zan, H. (eds.) NLPCC 2019. LNCS (LNAI), vol. 11839, pp. 791–800. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32236-6_72

    Chapter  MATH  Google Scholar 

  20. Li, X., Feng, J., Meng, Y., Han, Q., Wu, F., Li, J.: A unified MRC framework for named entity recognition. In: ACL 2020, pp. 5849–5859. Association for Computational Linguistics (2020)

    Google Scholar 

  21. Li, Z., et al.: KnowCoder: coding structured knowledge into LLMs for universal information extraction. CoRR abs/2403.07969 (2024). https://doi.org/10.48550/ARXIV.2403.07969

  22. Liu, X., Huang, H., Shi, G., Wang, B.: Dynamic prefix-tuning for generative template-based event extraction. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp. 5216–5228. Association for Computational Linguistics (2022). https://doi.org/10.18653/V1/2022.ACL-LONG.358

  23. Lou, J., et al.: Universal information extraction as unified semantic matching. In: Williams, B., Chen, Y., Neville, J. (eds.) Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pp. 13318–13326. AAAI Press (2023).https://doi.org/10.1609/aaai.v37i11.26563

  24. Lu, K., Pan, X., Song, K., Zhang, H., Yu, D., Chen, J.: PIVOINE: instruction tuning for open-world information extraction. CoRR abs/2305.14898 (2023). https://doi.org/10.48550/arXiv.2305.14898

  25. Lu, Y., et al.: Unified structure generation for universal information extraction. In: ACL 2022, pp. 5755–5772. Association for Computational Linguistics (2022)

    Google Scholar 

  26. Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 3219–3232. Association for Computational Linguistics (2018). https://doi.org/10.18653/V1/D18-1360

  27. Ma, Y., Cao, Y., Hong, Y., Sun, A.: Large language model is not a good few-shot information extractor, but a good reranker for hard samples! In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pp. 10572–10601. Association for Computational Linguistics (2023). https://aclanthology.org/2023.findings-emnlp.710

  28. Mihindukulasooriya, N., Tiwari, S., Enguix, C.F., Lata, K.: Text2KGBench: a benchmark for ontology-driven knowledge graph generation from text. In: Payne, T.R., Presutti, V., Qi, G., Poveda-Villalón, M., Stoilos, G., Hollink, L., Kaoudi, Z., Cheng, G., Li, J. (eds.) The Semantic Web – ISWC 2023: 22nd International Semantic Web Conference, Athens, Greece, November 6–10, 2023, Proceedings, Part II, pp. 247–265. Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-47243-5_14

    Chapter  Google Scholar 

  29. Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Su, K.Y., Su, J., Wiebe, J., Li, H. (eds.) Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1003–1011. Association for Computational Linguistics, Suntec, Singapore (Aug 2009). https://aclanthology.org/P09-1113

  30. OpenAI: GPT-4 technical report. CoRR abs/2303.08774 (2023). https://doi.org/10.48550/ARXIV.2303.08774

  31. Ouyang, L., Wu, J., Jiang, X., et al.: Training language models to follow instructions with human feedback. In: NeurIPS 2022 (2022)

    Google Scholar 

  32. Paolini, G., et al.: Structured prediction as translation between augmented natural languages. In: ICLR 2021. OpenReview.net (2021)

    Google Scholar 

  33. Riedel, S., Yao, L., McCallum, A.: Modeling relations and their mentions without labeled text. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS (LNAI), vol. 6323, pp. 148–163. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15939-8_10

    Chapter  MATH  Google Scholar 

  34. Sainz, O., García-Ferrero, I., Agerri, R., de Lacalle, O.L., Rigau, G., Agirre, E.: GoLLIE: annotation guidelines improve zero-shot information-extraction. CoRR abs/2310.03668 (2023). https://doi.org/10.48550/ARXIV.2310.03668

  35. Sainz, O., de Lacalle, O.L., Labaka, G., Barrena, A., Agirre, E.: Label verbalization and entailment for effective zero and few-shot relation extraction. In: Moens, M., Huang, X., Specia, L., Yih, S.W. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp. 1199–1212. Association for Computational Linguistics (2021). https://doi.org/10.18653/V1/2021.EMNLP-MAIN.92

  36. Sang, E.F.T.K., Meulder, F.D.: Introduction to the CONLL-2003 shared task: language-independent named entity recognition. In: Daelemans, W., Osborne, M. (eds.) Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003, Edmonton, Canada, May 31 - June 1, 2003, pp. 142–147. ACL (2003). https://aclanthology.org/W03-0419/

  37. Satyapanich, T., Ferraro, F., Finin, T.: CASIE: extracting cybersecurity event information from text. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 8749–8757. AAAI Press (2020). https://doi.org/10.1609/AAAI.V34I05.6401

  38. Smirnova, A., Cudré-Mauroux, P.: Relation extraction using distant supervision: a survey. ACM Comput. Surv. 51(5), 1–35 (2019). https://doi.org/10.1145/3241741

  39. Sun, Z., et al.: PHEE: a dataset for pharmacovigilance event extraction from text. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp. 5571–5587. Association for Computational Linguistics (2022).https://doi.org/10.18653/V1/2022.EMNLP-MAIN.376

  40. Touvron, H., Lavril, T., Izacard, G., et al.: LLaMA: open and efficient foundation language models. CoRR abs/2302.13971 (2023)

    Google Scholar 

  41. Touvron, H., Martin, L., Stone, K., et al.: LLaMA 2: open foundation and fine-tuned chat models. CoRR abs/2307.09288 (2023). https://doi.org/10.48550/arXiv.2307.09288

  42. Vania, C., Lee, G., Pierleoni, A.: Improving distantly supervised document-level relation extraction through natural language inference. In: Cherry, et al: (eds.) Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pp. 14–20. Association for Computational Linguistics, Hybrid (Jul 2022). https://doi.org/10.18653/v1/2022.deeplo-1.2

  43. Wadhwa, S., Amir, S., Wallace, B.C.: Revisiting relation extraction in the era of large language models. In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 15566–15589. Association for Computational Linguistics (2023). https://doi.org/10.18653/V1/2023.ACL-LONG.868

  44. Wan, Z., et al.: GPT-RE: in-context learning for relation extraction using large language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp. 3534–3547. Association for Computational Linguistics (2023). https://aclanthology.org/2023.emnlp-main.214

  45. Wang, H., He, Z., Ma, J., Chen, W., Zhang, M.: IPRE: a dataset for inter-personal relationship extraction. In: Tang, J., Kan, M.-Y., Zhao, D., Li, S., Zan, H. (eds.) NLPCC 2019. LNCS (LNAI), vol. 11839, pp. 103–115. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32236-6_9

    Chapter  MATH  Google Scholar 

  46. Wang, J., et al.: TechGPT-2.0: a large language model project to solve the task of knowledge graph construction (2024)

    Google Scholar 

  47. Wang, S., et al.: GPT-NER: named entity recognition via large language models. CoRR abs/2304.10428 (2023). https://doi.org/10.48550/arXiv.2304.10428

  48. Wang, X., et al.: InstructUIE: multi-task instruction tuning for unified information extraction. CoRR abs/2304.08085 (2023). https://doi.org/10.48550/ARXIV.2304.08085

  49. Wei, X., et al.: Zero-shot information extraction via chatting with ChatGPT. CoRR abs/2302.10205 (2023). https://doi.org/10.48550/arXiv.2302.10205

  50. Whitehouse, C., Vania, C., Aji, A.F., Christodoulopoulos, C., Pierleoni, A.: WebIE: faithful and robust information extraction on the web. In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 7734–7755. Association for Computational Linguistics (2023). https://doi.org/10.18653/v1/2023.acl-long.428

  51. Xiao, X., et al.: YAYI-UIE: a chat-enhanced instruction tuning framework for universal information extraction. CoRR abs/2312.15548 (2023).https://doi.org/10.48550/ARXIV.2312.15548

  52. Xie, T., Li, Q., Zhang, Y., Liu, Z., Wang, H.: Self-improving for zero-shot named entity recognition with large language models. CoRR abs/2311.08921 (2023). https://doi.org/10.48550/ARXIV.2311.08921

  53. Xu, D., et al.: Large language models for generative information extraction: a survey. CoRR abs/2312.17617 (2023). https://doi.org/10.48550/ARXIV.2312.17617

  54. Xue, L., et al.: mT5: a massively multilingual pre-trained text-to-text transformer. In: HLT-NAACL 2021, pp. 483–498. Association for Computational Linguistics (2021)

    Google Scholar 

  55. Ye, H., Gui, H., Xu, X., Chen, H., Zhang, N.: Schema-adaptable knowledge graph construction (2023)

    Google Scholar 

  56. Ye, H., Zhang, N., Chen, H., Chen, H.: Generative knowledge graph construction: a review. In: EMNLP 2022, pp. 1–17. Association for Computational Linguistics (2022)

    Google Scholar 

  57. Zeng, D., Zhang, H., Liu, Q.: CopyMTL: copy mechanism for joint extraction of entities and relations with multi-task learning. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 9507–9514. AAAI Press (2020). https://doi.org/10.1609/AAAI.V34I05.6495

  58. Zheng, S., Wang, F., Bao, H., Hao, Y., Zhou, P., Xu, B.: Joint extraction of entities and relations based on a novel tagging scheme. In: ACL 2017, pp. 1227–1236. Association for Computational Linguistics (2017)

    Google Scholar 

  59. Zhou, W., Zhang, S., Gu, Y., Chen, M., Poon, H.: UniversaLNER: targeted distillation from large language models for open named entity recognition. CoRR abs/2308.03279 (2023). https://doi.org/10.48550/ARXIV.2308.03279

Download references

Acknowledgements

We would like to express our sincere gratitude to the anonymous reviewers for their thoughtful and constructive feedback. This work was supported by the National Natural Science Foundation of China (No. 62206246, No. NSFCU23B2055, No. NSFCU19B2027), the Fundamental Research Funds for the Central Universities (226-2023-00138), Zhejiang Provincial Natural Science Foundation of China (No. LGG22F030011), Yongjiang Talent Introduction Programme (2021A-156-G), and Information Technology Center and State Key Lab of CAD&CG, Zhejiang University. This work was supported by Ant Group and Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ningyu Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gui, H. et al. (2025). InstructIE: A Bilingual Instruction-based Information Extraction Dataset. In: Demartini, G., et al. The Semantic Web – ISWC 2024. ISWC 2024. Lecture Notes in Computer Science, vol 15233. Springer, Cham. https://doi.org/10.1007/978-3-031-77847-6_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-77847-6_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-77846-9

  • Online ISBN: 978-3-031-77847-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy