Abstract
This paper presents an exhaustive quantitative and qualitative evaluation of Large Language Models (LLMs) for Knowledge Graph (KG) construction and reasoning. We engage in experiments across eight diverse datasets, focusing on four representative tasks encompassing entity and relation extraction, event extraction, link prediction, and question-answering, thereby thoroughly exploring LLMs’ performance in the domain of construction and inference. Empirically, our findings suggest that LLMs, represented by GPT-4, are more suited as inference assistants rather than few-shot information extractors. Specifically, while GPT-4 exhibits good performance in tasks related to KG construction, it excels further in reasoning tasks, surpassing fine-tuned models in certain cases. Moreover, our investigation extends to the potential generalization ability of LLMs for information extraction, leading to the proposition of a Virtual Knowledge Extraction task and the development of the corresponding VINE dataset. Based on these empirical findings, we further propose AutoKG, a multi-agent-based approach employing LLMs and external sources for KG construction and reasoning. We anticipate that this research can provide invaluable insights for future undertakings in the field of knowledge graphs.





Similar content being viewed by others
Data and Materials availability
Our data and materials are accessible in the repository here (The code and datasets are in https://github.com/zjunlp/AutoKG).
Notes
The code and datasets are in https://github.com/zjunlp/AutoKG.
References
Cai, B., Xiang, Y., Gao, L., Zhang, H., Li, Y., Li, J.: Temporal knowledge graph completion: A survey. In: Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China, pp. 6545–6553 (2023). https://doi.org/10.24963/IJCAI.2023/734
Zhu, X., Li, Z., Wang, X., Jiang, X., Sun, P., Wang, X., Xiao, Y., Yuan, N.J.: Multi-modal knowledge graph construction and application: A survey. IEEE Trans. Knowl. Data Eng. 36(2), 715–735 (2024). https://doi.org/10.1109/TKDE.2022.3224228
Liang, K., Meng, L., Liu, M., Liu, Y., Tu, W., Wang, S., Zhou, S., Liu, X., Sun, F.: Reasoning over different types of knowledge graphs: Static, temporal and multi-modal. CoRR (2022). https://doi.org/10.48550/arXiv.2212.05767
Chen, X., Zhang, J., Wang, X., Wu, T., Deng, S., Wang, Y., Si, L., Chen, H., Zhang, N.: Continual multimodal knowledge graph construction. CoRR (2023). https://doi.org/10.48550/arXiv.2305.08698
Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J., Wu, X.: Unifying large language models and knowledge graphs: A roadmap. IEEE Trans. Knowl. Data Eng. 36(7), 3580–3599 (2024). https://doi.org/10.1109/TKDE.2024.3352100
Pan, J.Z., Razniewski, S., Kalo, J., Singhania, S., Chen, J., Dietze, S., Jabeen, H., Omeliyanenko, J., Zhang, W., Lissandrini, M., Biswas, R., Melo, G., Bonifati, A., Vakaj, E., Dragoni, M., Graux, D.: Large language models and knowledge graphs: Opportunities and challenges. TGDK 1(1), 2–1238 (2023). https://doi.org/10.4230/.1.1.2
Ye, H., Zhang, N., Chen, H., Chen, H.: Generative knowledge graph construction: A review. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp. 1–17 (2022). https://doi.org/10.18653/v1/2022.emnlp-main.1
Ding, L., Zhou, S., Xiao, J., Han, J.: Automated construction of theme-specific knowledge graphs. CoRR (2024). https://doi.org/10.48550/ARXIV.2404.19146
Chiu, J.P.C., Nichols, E.: Named entity recognition with bidirectional lstm-cnns. Trans. Assoc. Comput. Linguistics 4, 357–370 (2016). https://doi.org/10.1162/tacl_a_00104
Gui, H., Yuan, L., Ye, H., Zhang, N., Sun, M., Liang, L., Chen, H.: Iepile: Unearthing large-scale schema-based information extraction corpus. CoRR (2024). https://doi.org/10.48550/ARXIV.2402.14710
Zeng, D., Liu, K., Chen, Y., Zhao, J.: Distant supervision for relation extraction via piecewise convolutional neural networks. In: Màrquez, L., Callison-Burch, C., Su, J., Pighin, D., Marton, Y. (eds.) Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pp. 1753–1762 (2015). https://doi.org/10.18653/v1/d15-1203
Chen, X., Zhang, N., Xie, X., Deng, S., Yao, Y., Tan, C., Huang, F., Si, L., Chen, H.: Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction. In: Laforest, F., Troncy, R., Simperl, E., Agarwal, D., Gionis, A., Herman, I., Médini, L. (eds.) WWW ’22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022, pp. 2778–2788 (2022). https://doi.org/10.1145/3485447.3511998
Chen, Y., Xu, L., Liu, K., Zeng, D., Zhao, J.: Event extraction via dynamic multi-pooling convolutional neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pp. 167–176 (2015). https://doi.org/10.3115/v1/p15-1017
Deng, S., Zhang, N., Kang, J., Zhang, Y., Zhang, W., Chen, H.: Meta-learning with dynamic-memory-based prototypical network for few-shot event detection. In: Caverlee, J., Hu, X.B., Lalmas, M., Wang, W. (eds.) WSDM ’20: The Thirteenth ACM International Conference on Web Search and Data Mining, Houston, TX, USA, February 3-7, 2020, pp. 151–159 (2020). https://doi.org/10.1145/3336191.3371796
Shen, W., Wang, J., Han, J.: Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Trans. Knowl. Data Eng. 27(2), 443–460 (2015). https://doi.org/10.1109/TKDE.2014.2327028
Zhang, Y., Dai, H., Kozareva, Z., Smola, A.J., Song, L.: Variational reasoning for question answering with knowledge graph. In: McIlraith, S.A., Weinberger, K.Q. (eds.) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018 (2018). https://doi.org/10.1609/aaai.v32i1.12057
Rossi, A., Barbosa, D., Firmani, D., Matinata, A., Merialdo, P.: Knowledge graph embedding for link prediction: A comparative analysis. ACM Trans. Knowl. Discov. Data 15(2), 14–11449 (2021). https://doi.org/10.1145/3424672
Karpukhin, V., Oguz, B., Min, S., Lewis, P.S.H., Wu, L., Edunov, S., Chen, D., Yih, W.: Dense passage retrieval for open-domain question answering. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.550
Zhu, F., Lei, W., Wang, C., Zheng, J., Poria, S., Chua, T.: Retrieving and reading: A comprehensive survey on open-domain question answering. CoRR (2021)
OpenAI: GPT-4 technical report. CoRR (2023). arxiv:2303.08774. https://doi.org/10.48550/arXiv.2303.08774
Liu, A., Hu, X., Wen, L., Yu, P.S.: A comprehensive evaluation of chatgpt’s zero-shot text-to-sql capability. CoRR (2023). https://doi.org/10.48550/arXiv.2303.13547
Shakarian, P., Koyyalamudi, A., Ngu, N., Mareedu, L.: An independent evaluation of chatgpt on mathematical word problems (MWP). In: Martin, A., Fill, H., Gerber, A., Hinkelmann, K., Lenat, D., Stolle, R., Harmelen, F. (eds.) Proceedings of the AAAI 2023 Spring Symposium on Challenges Requiring the Combination of Machine Learning and Knowledge Engineering (AAAI-MAKE 2023), Hyatt Regency, San Francisco Airport, California, USA, March 27-29, 2023. CEUR Workshop Proceedings, vol. 3433 (2023)
Lai, V.D., Ngo, N.T., Veyseh, A.P.B., Man, H., Dernoncourt, F., Bui, T., Nguyen, T.H.: Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023 (2023). https://doi.org/10.18653/v1/2023.findings-emnlp.878
Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J., Wen, J.: A survey of large language models. CoRR (2023). https://doi.org/10.48550/arXiv.2303.18223
Wei, X., Cui, X., Cheng, N., Wang, X., Zhang, X., Huang, S., Xie, P., Xu, J., Chen, Y., Zhang, M., Jiang, Y., Han, W.: Zero-shot information extraction via chatting with chatgpt. CoRR (2023). arxiv:2302.10205. https://doi.org/10.48550/arXiv.2302.10205
Li, B., Fang, G., Yang, Y., Wang, Q., Ye, W., Zhao, W., Zhang, S.: Evaluating chatgpt’s information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness. CoRR (2023)
Li, G., Wang, P., Ke, W.: Revisiting large language models as zero-shot relation extractors. In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023 (2023). https://doi.org/10.18653/v1/2023.findings-emnlp.459
Wan, Z., Cheng, F., Mao, Z., Liu, Q., Song, H., Li, J., Kurohashi, S.: GPT-RE: in-context learning for relation extraction using large language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 (2023). https://doi.org/10.18653/v1/2023.emnlp-main.214
Qin, C., Zhang, A., Zhang, Z., Chen, J., Yasunaga, M., Yang, D.: Is chatgpt a general-purpose natural language processing task solver? In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 (2023). https://doi.org/10.18653/v1/2023.emnlp-main.85
Liu, H., Ning, R., Teng, Z., Liu, J., Zhou, Q., Zhang, Y.: Evaluating the logical reasoning ability of chatgpt and GPT-4. CoRR (2023). https://doi.org/10.48550/ARXIV.2304.03439
Jiang, J., Zhou, K., Zhao, W.X., Song, Y., Zhu, C., Zhu, H., Wen, J.: Kg-agent: An efficient autonomous agent framework for complex reasoning over knowledge graph. CoRR (2024). https://doi.org/10.48550/ARXIV.2402.11163
Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H.W., Tay, Y., Zhou, D., Le, Q.V., Zoph, B., Wei, J., Roberts, A.: The flan collection: Designing data and methods for effective instruction tuning. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. Proceedings of Machine Learning Research (2023)
Christiano, P.F., Leike, J., Brown, T.B., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. In: Guyon, I., Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA (2017)
Leiter, C., Zhang, R., Chen, Y., Belouadi, J., Larionov, D., Fresen, V., Eger, S.: Chatgpt: A meta-analysis after 2.5 months. CoRR (2023). https://doi.org/10.48550/ARXIV.2302.13795
Yang, J., Jin, H., Tang, R., Han, X., Feng, Q., Jiang, H., Zhong, S., Yin, B., Hu, X.B.: Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Trans. Knowl. Discov. Data 18(6), 160–116032 (2024). https://doi.org/10.1145/3649506
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E.H., Le, Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 (2022)
Wang, Z., Zhang, G., Yang, K., Shi, N., Zhou, W., Hao, S., Xiong, G., Li, Y., Sim, M.Y., Chen, X., Zhu, Q., Yang, Z., Nik, A., Liu, Q., Lin, C., Wang, S., Liu, R., Chen, W., Xu, K., Liu, D., Guo, Y., Fu, J.: Interactive natural language processing. CoRR (2023). https://doi.org/10.48550/arXiv.2305.13246
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y.T., Li, Y., Lundberg, S.M., Nori, H., Palangi, H., Ribeiro, M.T., Zhang, Y.: Sparks of artificial general intelligence: Early experiments with GPT-4. CoRR (2023). https://doi.org/10.48550/ARXIV.2303.12712
Li, S., He, W., Shi, Y., Jiang, W., Liang, H., Jiang, Y., Zhang, Y., Lyu, Y., Zhu, Y.: Duie: A large-scale chinese dataset for information extraction. In: Tang, J., Kan, M., Zhao, D., Li, S., Zan, H. (eds.) Natural Language Processing and Chinese Computing - 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9-14, 2019, Proceedings, Part II. Lecture Notes in Computer Science, vol. 11839, pp. 791–800 (2019). https://doi.org/10.1007/978-3-030-32236-6_72
Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 3219–3232 (2018). https://doi.org/10.18653/v1/d18-1360
Stoica, G., Platanios, E.A., Póczos, B.: Re-tacred: Addressing shortcomings of the TACRED dataset. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp. 13843–13850 (2021)
Wang, X., Wang, Z., Han, X., Jiang, W., Han, R., Liu, Z., Li, J., Li, P., Lin, Y., Zhou, J.: MAVEN: A massive general domain event detection dataset. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 1652–1671 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.129
Toutanova, K., Chen, D., Pantel, P., Poon, H., Choudhury, P., Gamon, M.: Representing text for joint embedding of text and knowledge bases. In: Màrquez, L., Callison-Burch, C., Su, J., Pighin, D., Marton, Y. (eds.) Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pp. 1499–1509 (2015). https://doi.org/10.18653/v1/d15-1174
Hwang, J.D., Bhagavatula, C., Bras, R.L., Da, J., Sakaguchi, K., Bosselut, A., Choi, Y.: (comet-) atomic 2020: On symbolic and neural commonsense knowledge graphs. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp. 6384–6392 (2021)
Jiang, K., Wu, D., Jiang, H.: Freebaseqa: A new factoid QA data set matching trivia-style question-answer pairs with freebase. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 318–323 (2019). https://doi.org/10.18653/v1/n19-1028
Ye, D., Lin, Y., Li, P., Sun, M.: Packed levitated marker for entity and relation extraction. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp. 4904–4917 (2022). https://doi.org/10.18653/v1/2022.acl-long.337
Park, S., Kim, H.: Improving sentence-level relation extraction through curriculum learning. CoRR (2021)
Wang, S., Yu, M., Huang, L.: The art of prompting: Event detection based on type specific prompts. In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 1286–1299 (2023). https://doi.org/10.18653/v1/2023.acl-short.111
Wang, X., He, Q., Liang, J., Xiao, Y.: Language models as knowledge embeddings. In: Raedt, L.D. (ed.) Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pp. 2291–2297 (2022). https://doi.org/10.24963/ijcai.2022/318
Hwang, J.D., Bhagavatula, C., Bras, R.L., Da, J., Sakaguchi, K., Bosselut, A., Choi, Y.: (comet-) atomic 2020: On symbolic and neural commonsense knowledge graphs. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp. 6384–6392 (2021). https://doi.org/10.1609/aaai.v35i7.16792
Yu, D., Zhang, S., Ng, P., Zhu, H., Li, A.H., Wang, J., Hu, Y., Wang, W.Y., Wang, Z., Xiang, B.: Decaf: Joint decoding of answers and logical forms for question answering over knowledge bases. In: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 (2023)
Madani, N., Joseph, K.: Answering questions over knowledge graphs using logic programming along with language models. In: Maughan, K., Liu, R., Burns, T.F. (eds.) The First Tiny Papers Track at ICLR 2023, Tiny Papers @ ICLR 2023, Kigali, Rwanda, May 5, 2023 (2023)
Gao, J., Zhao, H., Yu, C., Xu, R.: Exploring the feasibility of chatgpt for event extraction. CoRR (2023). https://doi.org/10.48550/arXiv.2303.03836
Dong, Q., Li, L., Dai, D., Zheng, C., Wu, Z., Chang, B., Sun, X., Xu, J., Li, L., Sui, Z.: A survey for in-context learning. CoRR (2023). https://doi.org/10.48550/ARXIV.2301.00234
Wei, J.W., Wei, J., Tay, Y., Tran, D., Webson, A., Lu, Y., Chen, X., Liu, H., Huang, D., Zhou, D., Ma, T.: Larger language models do in-context learning differently. CoRR (2023)
Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., Zhao, W.X., Wei, Z., Wen, J.: A survey on large language model based autonomous agents. Frontiers Comput. Sci. 18(6), 186345 (2024). https://doi.org/10.1007/S11704-024-40231-1
Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., Zhang, M., Wang, J., Jin, S., Zhou, E., Zheng, R., Fan, X., Wang, X., Xiong, L., Zhou, Y., Wang, W., Jiang, C., Zou, Y., Liu, X., Yin, Z., Dou, S., Weng, R., Cheng, W., Zhang, Q., Qin, W., Zheng, Y., Qiu, X., Huan, X., Gui, T.: The rise and potential of large language model based agents: A survey. CoRR (2023). https://doi.org/10.48550/arXiv.2309.07864
Zhao, P., Jin, Z., Cheng, N.: An in-depth survey of large language model-based artificial intelligence agents. CoRR (2023). https://doi.org/10.48550/arXiv.2309.14365
Li, G., Hammoud, H.A.A.K., Itani, H., Khizbullin, D., Ghanem, B.: CAMEL: communicative agents for “mind” exploration of large scale language model society. CoRR (2023). https://doi.org/10.48550/arXiv.2303.17760
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, Virtual (2020). https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E.H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., Fedus, W.: Emergent abilities of large language models. Trans. Mach. Learn. Res. 2022 (2022)
Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Chung, W., Do, Q.V., Xu, Y., Fung, P.: A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. In: Park, J.C., Arase, Y., Hu, B., Lu, W., Wijaya, D., Purwarianti, A., Krisnadhi, A.A. (eds.) Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, IJCNLP 2023 -Volume 1: Long Papers, Nusa Dua, Bali, November 1 - 4, 2023, pp. 675–718 (2023). https://doi.org/10.18653/v1/2023.ijcnlp-main.45
Nori, H., King, N., McKinney, S.M., Carignan, D., Horvitz, E.: Capabilities of GPT-4 on medical challenge problems. CoRR (2023). https://doi.org/10.48550/ARXIV.2303.13375
Qiao, S., Ou, Y., Zhang, N., Chen, X., Yao, Y., Deng, S., Tan, C., Huang, F., Chen, H.: Reasoning with language model prompting: A survey. In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 5368–5393 (2023). https://doi.org/10.18653/v1/2023.acl-long.294
Sánchez, R.J., Conrads, L., Welke, P., Cvejoski, K., Marin, C.O.: Hidden schema networks. In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 4764–4798 (2023). https://doi.org/10.18653/v1/2023.acl-long.263
Ma, Y., Cao, Y., Hong, Y., Sun, A.: Large language model is not a good few-shot information extractor, but a good reranker for hard samples! In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pp. 10572–10601 (2023). https://doi.org/10.18653/v1/2023.findings-emnlp.710
Jeblick, K., Schachtner, B., Dexl, J., Mittermeier, A., Stüber, A.T., Topalis, J., Weber, T., Wesp, P., Sabel, B.O., Ricke, J., Ingrisch, M.: Chatgpt makes medicine easy to swallow: An exploratory case study on simplified radiology reports. CoRR (2022). https://doi.org/10.48550/arXiv.2212.14882
Tan, Y., Min, D., Li, Y., Li, W., Hu, N., Chen, Y., Qi, G.: Evaluation of chatgpt as a question answering system for answering complex questions. CoRR (2023). https://doi.org/10.48550/arXiv.2303.07992
Jiao, W., Wang, W., Huang, J., Wang, X., Tu, Z.: Is chatgpt A good translator? A preliminary study. CoRR (2023). https://doi.org/10.48550/arXiv.2301.08745
Kasai, J., Kasai, Y., Sakaguchi, K., Yamada, Y., Radev, D.: Evaluating GPT-4 and chatgpt on japanese medical licensing examinations. CoRR (2023). https://doi.org/10.48550/ARXIV.2303.18027
Sifatkaur, Singh, M., B, V.S., Malviya, N.: Mind meets machine: Unravelling gpt-4’s cognitive psychology. CoRR (2023). https://doi.org/10.48550/arXiv.2303.11436
Nunes, D., Primi, R., Pires, R., Alencar Lotufo, R., Nogueira, R.F.: Evaluating GPT-3.5 and GPT-4 models on brazilian university admission exams. CoRR (2023). https://doi.org/10.48550/arXiv.2303.17003
Lyu, Q., Tan, J., Zapadka, M.E., Ponnatapuram, J., Niu, C., Wang, G., Whitlow, C.T.: Translating radiology reports into plain language using chatgpt and GPT-4 with prompt learning: Promising results, limitations, and potential. Vis. Comput. Ind. Biomed. Art 6, 9 (2023). https://doi.org/10.1186/s42492-023-00136-5
Li, D., Tan, Z., Chen, T., Liu, H.: Contextualization distillation from large language model for knowledge graph completion. In: Graham, Y., Purver, M. (eds.) Findings of the Association for Computational Linguistics: EACL 2024, St. Julian’s, Malta, March 17-22, 2024, pp. 458–477 (2024). https://aclanthology.org/2024.findings-eacl.32
Li, F., Lin, Z., Zhang, M., Ji, D.: A span-based model for joint overlapped and discontinuous named entity recognition. In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.) Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp. 4814–4828 (2021). https://doi.org/10.18653/v1/2021.acl-long.372
Zhou, W., Zhang, S., Gu, Y., Chen, M., Poon, H.: Universalner: Targeted distillation from large language models for open named entity recognition. In: The Twelfth International Conference on Learning Representations, ICLR 2024 (2024). https://openreview.net/forum?id=r65xfUb76p
Jiang, P., Lin, J., Wang, Z., Sun, J., Han, J.: GenRES: Rethinking evaluation for generative relation extraction in the era of large language models. In: Duh, K., Gomez, H., Bethard, S. (eds.) Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 2820–2837. Association for Computational Linguistics, Mexico City, Mexico (2024). https://aclanthology.org/2024.naacl-long.155
Wang, L., Zhao, W., Wei, Z., Liu, J.: Simkgc: Simple contrastive knowledge graph completion with pre-trained language models. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp. 4281–4294 (2022). https://doi.org/10.18653/v1/2022.acl-long.295
Li, D., Zhu, B., Yang, S., Xu, K., Yi, M., He, Y., Wang, H.: Multi-task pre-training language model for semantic network completion. ACM Trans. Asian Low Resour. Lang. Inf. Process. 22(11), 250–125020 (2023). https://doi.org/10.1145/3627704
Shu, D., Chen, T., Jin, M., Zhang, Y., Zhang, C., Du, M., Zhang, Y.: Knowledge graph large language model (KG-LLM) for link prediction. CoRR (2024). https://doi.org/10.48550/ARXIV.2403.07311
Hao, S., Tan, B., Tang, K., Ni, B., Shao, X., Zhang, H., Xing, E.P., Hu, Z.: Bertnet: Harvesting knowledge graphs with arbitrary relations from pretrained language models. In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 5000–5015 (2023). https://doi.org/10.18653/v1/2023.findings-acl.309
Petroni, F., Rocktäschel, T., Riedel, S., Lewis, P.S.H., Bakhtin, A., Wu, Y., Miller, A.H.: Language models as knowledge bases? In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp. 2463–2473 (2019). https://doi.org/10.18653/v1/D19-1250
AlKhamissi, B., Li, M., Celikyilmaz, A., Diab, M.T., Ghazvininejad, M.: A review on language models as knowledge bases. CoRR (2022). https://doi.org/10.48550/ARXIV.2204.06031
West, P., Bhagavatula, C., Hessel, J., Hwang, J.D., Jiang, L., Bras, R.L., Lu, X., Welleck, S., Choi, Y.: Symbolic knowledge distillation: from general language models to commonsense models. In: Carpuat, M., Marneffe, M., Ruíz, I.V.M. (eds.) Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pp. 4602–4625 (2022). https://doi.org/10.18653/v1/2022.naacl-main.341
Luo, L., Ju, J., Xiong, B., Li, Y., Haffari, G., Pan, S.: Chatrule: Mining logical rules with large language models for knowledge graph reasoning. CoRR (2023). https://doi.org/10.48550/ARXIV.2309.01538
Miller, A.H., Fisch, A., Dodge, J., Karimi, A., Bordes, A., Weston, J.: Key-value memory networks for directly reading documents. In: Su, J., Carreras, X., Duh, K. (eds.) Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp. 1400–1409 (2016). https://doi.org/10.18653/v1/d16-1147
Funding
No funding was received to assist with the preparation of our work.
Author information
Authors and Affiliations
Contributions
Yuqi Zhu, Xiaohan Wang and Jing Chen wrote the main manuscript text. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This work did not involve any human participants, their data, or biological materials, and therefore did not require ethical approval.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Neuro-Symbolic Intelligence: Large Language Model Enabled Knowledge Engineering
Guest Editor: Haofen Wang, Arijit Khan, Jun Liu and Michael Witbrock
Appendices
Appendix A: Related work
1.1 A.1 Large language models
LLMs are pre-trained on substantial amounts of textual data and have become a significant component of contemporary NLP research. Recent advancements in NLP have led to the development of highly capable LLMs, such as GPT-3 [60], ChatGPT, and GPT-4, which exhibit exceptional performance across a diverse array of NLP tasks, including machine translation, text summarization, and question answering. Concurrently, several previous studies have indicated that LLMs can achieve remarkable results in relevant downstream tasks with minimal or even no demonstration in the prompt [25, 61,62,63,64]. Sánchez et al. [65] proposes a novel neural language model that incorporates inductive biases to enforce explicit relational structures. of pretrained language models. This provides further evidence of the robustness and generality of LLMs.
1.2 A.2 ChatGPT & GPT-4
ChatGPT, an advanced LLM developed by OpenAI, is primarily designed for engaging in human-like conversations. During the fine-tuning process, ChatGPT utilizes RLHF [33], thereby enhancing its alignment with human preferences and values.
As a cutting-edge large language model developed by OpenAI, GPT-4 is building upon the successes of its predecessors like GPT-3 and ChatGPT. Trained on an unparalleled scale of computation and data, it exhibits remarkable generalization, inference, and problem-solving capabilities across diverse domains. In addition, as a large-scale multimodal model, GPT-4 is capable of processing both image and text inputs. In general, the public release of GPT-4 offers fresh insights into the future advancement of LLMs and presents novel opportunities and challenges within the realm of NLP.
With the popularity of LLMs, an increasing number of researchers are exploring the specific emergent capabilities and advantages they possess [66]. Bang et al. [62] performs the in-depth analysis of ChatGPT on the multitask, multilingual and multimodal aspects. The findings indicate that ChatGPT excels at zero-shot learning across various tasks, even outperforming fine-tuned models in certain cases. However, it faces challenges when generalized to low-resource languages. Furthermore, in terms of multi-modality, compared to more advanced vision-language models, the capabilities of ChatGPT remain fundamental. Moreover, ChatGPT has garnered considerable attention in other various domains, including information extraction [25, 53], reasoning [29], text summarization [67], question answering [68] and machine translation [69], showcasing its versatility and applicability in the broader field of natural language processing.
While there is a growing body of research on ChatGPT, investigations into GPT-4 continue to be relatively limited. Nori et al. [63] conducts an extensive assessment of GPT-4 on medical competency examinations and benchmark datasets and shows that GPT-4, without any specialized prompt crafting, surpasses the passing score by over 20 points. Kasai et al. [70] also studies GPT-4’s performance on the Japanese national medical licensing examinations. Furthermore, there are some studies on GPT-4 that focus on cognitive psychology [71], academic exams [72], and translation of radiology reports [73].
1.3 A.3 LLMs for KG
Now many studies leverage large language models to facilitate the construction of knowledge graphs[74]. Some of these tasks focus on specific subtasks within KG construction. For instance, LLMs are utilized for named entity recognition and classification [75, 76], leveraging their contextual understanding and linguistic knowledge. Furthermore, LLMs have also demonstrated utility in tasks such as relation extraction [25, 77] and link prediction [78,79,80]. In line with our approach, several studies have explored the use of LLMs as knowledge bases [74, 81,82,83] to support KG construction. For example, some researchers [84] propose a symbolic knowledge distillation framework that extracts symbolic knowledge from LLMs. They first extract commonsense facts from large LLMs like GPT-3, fine-tune smaller student LLMs, and then use these student models to generate KGs. Concurrently, ChatRule [85] uses LLMs to mine logical rules from KGs, addressing computational intensity and scalability issues present in existing methods. ChatRule generates rules with LLMs, integrating the semantic and structural information of KGs, and employs a rule ranking module to evaluate rule quality. These studies highlight the extensive potential of LLMs in KG construction, promoting the automation and intelligent development of this field.
Appendix B: Datasets
Entity, Relation and Event Extraction DuIE2.0 [39] is a substantial Chinese relationship extraction dataset with more than 210,000 sentences and 48 predefined relationship categories. SciERC [40] is a collection of scientific abstracts annotated with seven relations. Re-TACRED [41], an upgraded version of the TACRED dataset, includes over 91,000 sentences across 40 relations. MAVEN [42] a general-domain event extraction benchmark containing 4,480 documents and 168 event types.
Link Prediction FB15K-237 [43] is widely used as a benchmark for assessing the performance of knowledge graph embedding model on link prediction, encompassing 237 relations and 14,541 entities. ATOMIC 2020 [44] serves as a comprehensive commonsense repository with 1.33 million inferential knowledge tuples about entities and events.
Question Answering FreebaseQA [45] is an open-domain QA dataset built on the Freebase knowledge graph, comprising various sourced question-answer pairs. MetaQA [16], expanded from WikiMovies [86], provides a substantial collection of single-hop and multi-hop question-answer pairs, surpassing 400,000 in total.
Appendix C: Data collection of VINE
Using GPT-4 data up to September 2021 as a basis, we select a portion of participants’ responses from two competitions organized by the New York Times as part of our data sources. These competitions include the “February Vocabulary Challenge: Invent a Word”Footnote 5 held in January 2022 and the “Student Vocabulary Challenge: Invent a Word”Footnote 6 conducted in February 2023. Both competitions aim to promote the creation of distinctive and memorable new words that address gaps in the English language.
Our constructed dataset includes 1,400 sentences, 39 novel relations, and 786 unique entities. In the construction process, we ensure that each relation type had a minimum of 10 associated samples to facilitate subsequent experiments. Notably, we find that in the Re-TACRED test set, certain types of relations have fewer than 10 corresponding data instances. To better conduct our experiments, we select sentences of corresponding types from the training set to offset this deficiency.
Appendix D: Prompts for evaluation
Here we list the prompts used in each task during the experiment.
Appendix E: Prompts for virtual knowledge extraction
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhu, Y., Wang, X., Chen, J. et al. LLMs for knowledge graph construction and reasoning: recent capabilities and future opportunities. World Wide Web 27, 58 (2024). https://doi.org/10.1007/s11280-024-01297-w
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11280-024-01297-w