Abstract
Evaluating multiple-choice questions (MCQs) involves either labor-intensive human assessments or automated methods that prioritize readability, often overlooking deeper question design flaws. To address this issue, we introduce the Scalable Automatic Question Usability Evaluation Toolkit (SAQUET), an open-source tool that leverages the Item-Writing Flaws (IWF) rubric for a comprehensive and automated quality evaluation of MCQs. By harnessing the latest in large language models such as GPT-4, advanced word embeddings, and Transformers designed to analyze textual complexity, SAQUET effectively pinpoints and assesses a wide array of flaws in MCQs. We first demonstrate the discrepancy between commonly used automated evaluation metrics and the human assessment of MCQ quality. Then we evaluate SAQUET on a diverse dataset of MCQs across the five domains of Chemistry, Statistics, Computer Science, Humanities, and Healthcare, showing how it effectively distinguishes between flawed and flawless questions, providing a level of analysis beyond what is achievable with traditional metrics. With an accuracy rate of over 94% in detecting the presence of flaws identified by human evaluators, our findings emphasize the limitations of existing evaluation methods and showcase potential in improving the quality of educational assessments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Azevedo, J.M., Oliveira, E.P., Beites, P.D.: Using learning analytics to evaluate the quality of multiple-choice questions: A perspective with classical test theory and item response theory. Int. J. Inf. Learn. Technol. 36(4), 322–341 (2019)
Bhowmick, A.K., Jagmohan, A., Vempaty, A., Dey, P., Hall, L., Hartman, J., Kokku, R., Maheshwari, H.: Automating Question Generation From Educational Text. In: Artificial Intelligence XL. pp. 437–450 Springer Nature Switzerland, Cham (2023)
Bitew, S.K., Deleu, J., Develder, C., Demeester, T.: Distractor generation for multiple-choice questions with predictive prompting and large language models. In: RKDE2023, the 1st International Tutorial and Workshop on Responsible Knowledge Discovery in Education Side event at ECML-PKDD (2023)
Bulathwela, S., Muse, H., Yilmaz, E.: Scalable Educational Question Generation with Pre-trained Language Models. In: Wang, N., Rebolledo-Mendez, G., Matsuda, N., Santos, O.C., and Dimitrova, V. (eds.) Artificial Intelligence in Education. pp. 327–339 Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-36272-9_27
Costello, E., Holland, J.C., Kirwan, C.: Evaluation of MCQs from MOOCs for common item writing flaws. BMC Res. (2018). https://doi.org/10.1186/s13104-018-3959-4
Doughty, J. et al.: A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education. In: Proceedings of the 26th Australasian Computing Education Conference. pp. 114–123 ACM, Sydney NSW Australia (2024)
Elkins, S., Kochmar, E., Cheung, J.C.K., Serban, I.: How Teachers Can Use Large Language Models and Bloom’s Taxonomy to Create Educational Quizzes. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)
Faruqui, M., Das, D.: Identifying Well-formed Natural Language Questions. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 798–803 (2018)
Ganda, D., Buch, R.: A survey on multi label classification. Recent Trends Program. Lang. 5(1), 19–23 (2018)
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E.: ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 102274 (2023)
Kurdi, G., Leo, J., Parsia, B., Sattler, U., Al-Emari, S.: A systematic review of automatic question generation for educational purposes. Int. J. Artif. Intell. Educ. 30 (2020)
van der Lee, C., Gatt, A., van Miltenburg, E., Krahmer, E.: Human evaluation of automatically generated text: Current trends and best practice guidelines. Comput. Speech Lang. 67, 101151 (2021). https://doi.org/10.1016/j.csl.2020.101151
Li, J., Galley, M., Brockett, C., Gao, J., Dolan, W.B.: A Diversity-Promoting Objective Function for Neural Conversation Models. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 110–119 (2016)
Lipton, Z.C., Elkan, C., Narayanaswamy, B.: Thresholding classifiers to maximize F1 score. stat. 1050, 14 (2014)
Lu, X., Fan, S., Houghton, J., Wang, L., Wang, X.: ReadingQuizMaker: A Human-NLP Collaborative System that Supports Instructors to Design High-Quality Reading Quiz Questions. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 1–18 ACM, Hamburg Germany (2023). doi.org/https://doi.org/10.1145/3544548.3580957
Mathur, N., Baldwin, T., Cohn, T.: Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 4984–4997 (2020)
Monrad, S.U., et al.: What faculty write versus what students see? Perspectives on multiple-choice questions using Bloom’s taxonomy. Med. Teach. 43, 575–582 (2021)
Moon, H., Yang, Y., Yu, H., Lee, S., Jeong, M., Park, J., Shin, J., Kim, M., Choi, S.: Evaluating the Knowledge Dependency of Questions. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (2022)
Moore, S., Nguyen, H.A., Chen, T., Stamper, J.: Assessing the Quality of Multiple-Choice Questions Using GPT-4 and Rule-Based Methods. In: Responsive and Sustainable Educational Futures. pp. 229–245 Springer Nature Switzerland, Cham (2023)
Morris, J.: Python Language Tool, github.com/jxmorris12/language_tool_python (2022)
Mulla, N., Gharpure, P.: Automatic question generation: a review of methodologies, datasets, evaluation metrics, and applications. Prog. Artif. Intell. 12(1), 1–32 (2023)
Nasution, N.E.A.: Using artificial intelligence to create biology multiple choice questions for higher education. Agric. Environ. Educ. 2, 1 (2023)
Pham, H., Besanko, J., Devitt, P.: Examining the impact of specific types of item-writing flaws on student performance and psychometric properties of the multiple choice question. MedEdPublish. 7, 225 (2018)
Raina, V., Gales, M.: Multiple-Choice Question Generation: Towards an Automated Assessment Framework, http://arxiv.org/abs/2209.11830 (2022)
Scully, D.: Constructing multiple-choice items to measure higher-order thinking. Pract. Assess. Res. Eval. 22, 1, 4 (2019)
Tarrant, M., Knierim, A., Hayes, S.K., Ware, J.: The frequency of item writing flaws in multiple-choice questions used in high stakes nursing assessments. Nurse Educ. Today 26(8), 662–671 (2006)
Wang, Z., Funakoshi, K., Okumura, M.: Automatic Answerability Evaluation for Question Generation, http://arxiv.org/abs/2309.12546 (2023)
Wang, Z., Valdez, J., Basu Mallick, D., Baraniuk, R.G.: Towards Human-Like Educational Question Generation with Large Language Models. In: Rodrigo, M.M., Matsuda, N., Cristea, A.I., Dimitrova, V. (eds.) Artificial Intelligence in Education, pp. 153–166. Springer International Publishing, Cham (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Moore, S., Costello, E., Nguyen, H.A., Stamper, J. (2024). An Automatic Question Usability Evaluation Toolkit. In: Olney, A.M., Chounta, IA., Liu, Z., Santos, O.C., Bittencourt, I.I. (eds) Artificial Intelligence in Education. AIED 2024. Lecture Notes in Computer Science(), vol 14830. Springer, Cham. https://doi.org/10.1007/978-3-031-64299-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-64299-9_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-64298-2
Online ISBN: 978-3-031-64299-9
eBook Packages: Computer ScienceComputer Science (R0)