An Automatic Question Usability Evaluation Toolkit

Moore, Steven; Costello, Eamon; Nguyen, Huy A.; Stamper, John

doi:10.1007/978-3-031-64299-9_3

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14830))

Included in the following conference series:

International Conference on Artificial Intelligence in Education

2293 Accesses

Abstract

Evaluating multiple-choice questions (MCQs) involves either labor-intensive human assessments or automated methods that prioritize readability, often overlooking deeper question design flaws. To address this issue, we introduce the Scalable Automatic Question Usability Evaluation Toolkit (SAQUET), an open-source tool that leverages the Item-Writing Flaws (IWF) rubric for a comprehensive and automated quality evaluation of MCQs. By harnessing the latest in large language models such as GPT-4, advanced word embeddings, and Transformers designed to analyze textual complexity, SAQUET effectively pinpoints and assesses a wide array of flaws in MCQs. We first demonstrate the discrepancy between commonly used automated evaluation metrics and the human assessment of MCQ quality. Then we evaluate SAQUET on a diverse dataset of MCQs across the five domains of Chemistry, Statistics, Computer Science, Humanities, and Healthcare, showing how it effectively distinguishes between flawed and flawless questions, providing a level of analysis beyond what is achievable with traditional metrics. With an accuracy rate of over 94% in detecting the presence of flaws identified by human evaluators, our findings emphasize the limitations of existing evaluation methods and showcase potential in improving the quality of educational assessments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Assessing the Quality of Multiple-Choice Questions Using GPT-4 and Rule-Based Methods

Exploring quality criteria and evaluation methods in automated question generation: A comprehensive survey

Article 07 June 2024

A General Framework for Multiple Choice Question Answering Based on Mutual Information and Reinforced Co-occurrence

Notes

References

Azevedo, J.M., Oliveira, E.P., Beites, P.D.: Using learning analytics to evaluate the quality of multiple-choice questions: A perspective with classical test theory and item response theory. Int. J. Inf. Learn. Technol. 36(4), 322–341 (2019)
Article Google Scholar
Bhowmick, A.K., Jagmohan, A., Vempaty, A., Dey, P., Hall, L., Hartman, J., Kokku, R., Maheshwari, H.: Automating Question Generation From Educational Text. In: Artificial Intelligence XL. pp. 437–450 Springer Nature Switzerland, Cham (2023)
Google Scholar
Bitew, S.K., Deleu, J., Develder, C., Demeester, T.: Distractor generation for multiple-choice questions with predictive prompting and large language models. In: RKDE2023, the 1st International Tutorial and Workshop on Responsible Knowledge Discovery in Education Side event at ECML-PKDD (2023)
Google Scholar
Bulathwela, S., Muse, H., Yilmaz, E.: Scalable Educational Question Generation with Pre-trained Language Models. In: Wang, N., Rebolledo-Mendez, G., Matsuda, N., Santos, O.C., and Dimitrova, V. (eds.) Artificial Intelligence in Education. pp. 327–339 Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-36272-9_27
Costello, E., Holland, J.C., Kirwan, C.: Evaluation of MCQs from MOOCs for common item writing flaws. BMC Res. (2018). https://doi.org/10.1186/s13104-018-3959-4
Article Google Scholar
Doughty, J. et al.: A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education. In: Proceedings of the 26th Australasian Computing Education Conference. pp. 114–123 ACM, Sydney NSW Australia (2024)
Google Scholar
Elkins, S., Kochmar, E., Cheung, J.C.K., Serban, I.: How Teachers Can Use Large Language Models and Bloom’s Taxonomy to Create Educational Quizzes. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)
Google Scholar
Faruqui, M., Das, D.: Identifying Well-formed Natural Language Questions. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 798–803 (2018)
Google Scholar
Ganda, D., Buch, R.: A survey on multi label classification. Recent Trends Program. Lang. 5(1), 19–23 (2018)
Google Scholar
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E.: ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 102274 (2023)
Google Scholar
Kurdi, G., Leo, J., Parsia, B., Sattler, U., Al-Emari, S.: A systematic review of automatic question generation for educational purposes. Int. J. Artif. Intell. Educ. 30 (2020)
Google Scholar
van der Lee, C., Gatt, A., van Miltenburg, E., Krahmer, E.: Human evaluation of automatically generated text: Current trends and best practice guidelines. Comput. Speech Lang. 67, 101151 (2021). https://doi.org/10.1016/j.csl.2020.101151
Article Google Scholar
Li, J., Galley, M., Brockett, C., Gao, J., Dolan, W.B.: A Diversity-Promoting Objective Function for Neural Conversation Models. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 110–119 (2016)
Google Scholar
Lipton, Z.C., Elkan, C., Narayanaswamy, B.: Thresholding classifiers to maximize F1 score. stat. 1050, 14 (2014)
Google Scholar
Lu, X., Fan, S., Houghton, J., Wang, L., Wang, X.: ReadingQuizMaker: A Human-NLP Collaborative System that Supports Instructors to Design High-Quality Reading Quiz Questions. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 1–18 ACM, Hamburg Germany (2023). doi.org/https://doi.org/10.1145/3544548.3580957
Mathur, N., Baldwin, T., Cohn, T.: Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 4984–4997 (2020)
Google Scholar
Monrad, S.U., et al.: What faculty write versus what students see? Perspectives on multiple-choice questions using Bloom’s taxonomy. Med. Teach. 43, 575–582 (2021)
Article Google Scholar
Moon, H., Yang, Y., Yu, H., Lee, S., Jeong, M., Park, J., Shin, J., Kim, M., Choi, S.: Evaluating the Knowledge Dependency of Questions. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (2022)
Google Scholar
Moore, S., Nguyen, H.A., Chen, T., Stamper, J.: Assessing the Quality of Multiple-Choice Questions Using GPT-4 and Rule-Based Methods. In: Responsive and Sustainable Educational Futures. pp. 229–245 Springer Nature Switzerland, Cham (2023)
Google Scholar
Morris, J.: Python Language Tool, github.com/jxmorris12/language_tool_python (2022)
Google Scholar
Mulla, N., Gharpure, P.: Automatic question generation: a review of methodologies, datasets, evaluation metrics, and applications. Prog. Artif. Intell. 12(1), 1–32 (2023)
Article Google Scholar
Nasution, N.E.A.: Using artificial intelligence to create biology multiple choice questions for higher education. Agric. Environ. Educ. 2, 1 (2023)
Google Scholar
Pham, H., Besanko, J., Devitt, P.: Examining the impact of specific types of item-writing flaws on student performance and psychometric properties of the multiple choice question. MedEdPublish. 7, 225 (2018)
Article Google Scholar
Raina, V., Gales, M.: Multiple-Choice Question Generation: Towards an Automated Assessment Framework, http://arxiv.org/abs/2209.11830 (2022)
Scully, D.: Constructing multiple-choice items to measure higher-order thinking. Pract. Assess. Res. Eval. 22, 1, 4 (2019)
Google Scholar
Tarrant, M., Knierim, A., Hayes, S.K., Ware, J.: The frequency of item writing flaws in multiple-choice questions used in high stakes nursing assessments. Nurse Educ. Today 26(8), 662–671 (2006)
Article Google Scholar
Wang, Z., Funakoshi, K., Okumura, M.: Automatic Answerability Evaluation for Question Generation, http://arxiv.org/abs/2309.12546 (2023)
Wang, Z., Valdez, J., Basu Mallick, D., Baraniuk, R.G.: Towards Human-Like Educational Question Generation with Large Language Models. In: Rodrigo, M.M., Matsuda, N., Cristea, A.I., Dimitrova, V. (eds.) Artificial Intelligence in Education, pp. 153–166. Springer International Publishing, Cham (2022)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Steven Moore, Huy A. Nguyen & John Stamper
Dublin City University, Dublin, 09Y0A3, Ireland
Eamon Costello

Authors

Steven Moore
View author publications
You can also search for this author in PubMed Google Scholar
Eamon Costello
View author publications
You can also search for this author in PubMed Google Scholar
Huy A. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
John Stamper
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steven Moore .

Editor information

Editors and Affiliations

University of Memphis, Memphis, TN, USA
Andrew M. Olney
University of Duisburg-Essen, Duisburg, Germany
Irene-Angelica Chounta
Jinan University, Guangzhou, China
Zitao Liu
UNED, Madrid, Spain
Olga C. Santos
Universidade Federal de Alagoas, Maceio, Brazil
Ig Ibert Bittencourt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Moore, S., Costello, E., Nguyen, H.A., Stamper, J. (2024). An Automatic Question Usability Evaluation Toolkit. In: Olney, A.M., Chounta, IA., Liu, Z., Santos, O.C., Bittencourt, I.I. (eds) Artificial Intelligence in Education. AIED 2024. Lecture Notes in Computer Science(), vol 14830. Springer, Cham. https://doi.org/10.1007/978-3-031-64299-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-64299-9_3
Published: 02 July 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-64298-2
Online ISBN: 978-3-031-64299-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Automatic Question Usability Evaluation Toolkit

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Assessing the Quality of Multiple-Choice Questions Using GPT-4 and Rule-Based Methods

Exploring quality criteria and evaluation methods in automated question generation: A comprehensive survey

A General Framework for Multiple Choice Question Answering Based on Mutual Information and Reinforced Co-occurrence

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

An Automatic Question Usability Evaluation Toolkit

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Assessing the Quality of Multiple-Choice Questions Using GPT-4 and Rule-Based Methods

Exploring quality criteria and evaluation methods in automated question generation: A comprehensive survey

A General Framework for Multiple Choice Question Answering Based on Mutual Information and Reinforced Co-occurrence

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.