Skip to main content

An Automatic Question Usability Evaluation Toolkit

  • Conference paper
  • First Online:
Artificial Intelligence in Education (AIED 2024)

Abstract

Evaluating multiple-choice questions (MCQs) involves either labor-intensive human assessments or automated methods that prioritize readability, often overlooking deeper question design flaws. To address this issue, we introduce the Scalable Automatic Question Usability Evaluation Toolkit (SAQUET), an open-source tool that leverages the Item-Writing Flaws (IWF) rubric for a comprehensive and automated quality evaluation of MCQs. By harnessing the latest in large language models such as GPT-4, advanced word embeddings, and Transformers designed to analyze textual complexity, SAQUET effectively pinpoints and assesses a wide array of flaws in MCQs. We first demonstrate the discrepancy between commonly used automated evaluation metrics and the human assessment of MCQ quality. Then we evaluate SAQUET on a diverse dataset of MCQs across the five domains of Chemistry, Statistics, Computer Science, Humanities, and Healthcare, showing how it effectively distinguishes between flawed and flawless questions, providing a level of analysis beyond what is achievable with traditional metrics. With an accuracy rate of over 94% in detecting the presence of flaws identified by human evaluators, our findings emphasize the limitations of existing evaluation methods and showcase potential in improving the quality of educational assessments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/StevenJamesMoore/AIED24

  2. 2.

    https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo

References

  1. Azevedo, J.M., Oliveira, E.P., Beites, P.D.: Using learning analytics to evaluate the quality of multiple-choice questions: A perspective with classical test theory and item response theory. Int. J. Inf. Learn. Technol. 36(4), 322–341 (2019)

    Article  Google Scholar 

  2. Bhowmick, A.K., Jagmohan, A., Vempaty, A., Dey, P., Hall, L., Hartman, J., Kokku, R., Maheshwari, H.: Automating Question Generation From Educational Text. In: Artificial Intelligence XL. pp. 437–450 Springer Nature Switzerland, Cham (2023)

    Google Scholar 

  3. Bitew, S.K., Deleu, J., Develder, C., Demeester, T.: Distractor generation for multiple-choice questions with predictive prompting and large language models. In: RKDE2023, the 1st International Tutorial and Workshop on Responsible Knowledge Discovery in Education Side event at ECML-PKDD (2023)

    Google Scholar 

  4. Bulathwela, S., Muse, H., Yilmaz, E.: Scalable Educational Question Generation with Pre-trained Language Models. In: Wang, N., Rebolledo-Mendez, G., Matsuda, N., Santos, O.C., and Dimitrova, V. (eds.) Artificial Intelligence in Education. pp. 327–339 Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-36272-9_27

  5. Costello, E., Holland, J.C., Kirwan, C.: Evaluation of MCQs from MOOCs for common item writing flaws. BMC Res. (2018). https://doi.org/10.1186/s13104-018-3959-4

    Article  Google Scholar 

  6. Doughty, J. et al.: A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education. In: Proceedings of the 26th Australasian Computing Education Conference. pp. 114–123 ACM, Sydney NSW Australia (2024)

    Google Scholar 

  7. Elkins, S., Kochmar, E., Cheung, J.C.K., Serban, I.: How Teachers Can Use Large Language Models and Bloom’s Taxonomy to Create Educational Quizzes. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

    Google Scholar 

  8. Faruqui, M., Das, D.: Identifying Well-formed Natural Language Questions. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 798–803 (2018)

    Google Scholar 

  9. Ganda, D., Buch, R.: A survey on multi label classification. Recent Trends Program. Lang. 5(1), 19–23 (2018)

    Google Scholar 

  10. Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E.: ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 102274 (2023)

    Google Scholar 

  11. Kurdi, G., Leo, J., Parsia, B., Sattler, U., Al-Emari, S.: A systematic review of automatic question generation for educational purposes. Int. J. Artif. Intell. Educ. 30 (2020)

    Google Scholar 

  12. van der Lee, C., Gatt, A., van Miltenburg, E., Krahmer, E.: Human evaluation of automatically generated text: Current trends and best practice guidelines. Comput. Speech Lang. 67, 101151 (2021). https://doi.org/10.1016/j.csl.2020.101151

    Article  Google Scholar 

  13. Li, J., Galley, M., Brockett, C., Gao, J., Dolan, W.B.: A Diversity-Promoting Objective Function for Neural Conversation Models. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 110–119 (2016)

    Google Scholar 

  14. Lipton, Z.C., Elkan, C., Narayanaswamy, B.: Thresholding classifiers to maximize F1 score. stat. 1050, 14 (2014)

    Google Scholar 

  15. Lu, X., Fan, S., Houghton, J., Wang, L., Wang, X.: ReadingQuizMaker: A Human-NLP Collaborative System that Supports Instructors to Design High-Quality Reading Quiz Questions. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 1–18 ACM, Hamburg Germany (2023). doi.org/https://doi.org/10.1145/3544548.3580957

  16. Mathur, N., Baldwin, T., Cohn, T.: Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 4984–4997 (2020)

    Google Scholar 

  17. Monrad, S.U., et al.: What faculty write versus what students see? Perspectives on multiple-choice questions using Bloom’s taxonomy. Med. Teach. 43, 575–582 (2021)

    Article  Google Scholar 

  18. Moon, H., Yang, Y., Yu, H., Lee, S., Jeong, M., Park, J., Shin, J., Kim, M., Choi, S.: Evaluating the Knowledge Dependency of Questions. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (2022)

    Google Scholar 

  19. Moore, S., Nguyen, H.A., Chen, T., Stamper, J.: Assessing the Quality of Multiple-Choice Questions Using GPT-4 and Rule-Based Methods. In: Responsive and Sustainable Educational Futures. pp. 229–245 Springer Nature Switzerland, Cham (2023)

    Google Scholar 

  20. Morris, J.: Python Language Tool, github.com/jxmorris12/language_tool_python (2022)

    Google Scholar 

  21. Mulla, N., Gharpure, P.: Automatic question generation: a review of methodologies, datasets, evaluation metrics, and applications. Prog. Artif. Intell. 12(1), 1–32 (2023)

    Article  Google Scholar 

  22. Nasution, N.E.A.: Using artificial intelligence to create biology multiple choice questions for higher education. Agric. Environ. Educ. 2, 1 (2023)

    Google Scholar 

  23. Pham, H., Besanko, J., Devitt, P.: Examining the impact of specific types of item-writing flaws on student performance and psychometric properties of the multiple choice question. MedEdPublish. 7, 225 (2018)

    Article  Google Scholar 

  24. Raina, V., Gales, M.: Multiple-Choice Question Generation: Towards an Automated Assessment Framework, http://arxiv.org/abs/2209.11830 (2022)

  25. Scully, D.: Constructing multiple-choice items to measure higher-order thinking. Pract. Assess. Res. Eval. 22, 1, 4 (2019)

    Google Scholar 

  26. Tarrant, M., Knierim, A., Hayes, S.K., Ware, J.: The frequency of item writing flaws in multiple-choice questions used in high stakes nursing assessments. Nurse Educ. Today 26(8), 662–671 (2006)

    Article  Google Scholar 

  27. Wang, Z., Funakoshi, K., Okumura, M.: Automatic Answerability Evaluation for Question Generation, http://arxiv.org/abs/2309.12546 (2023)

  28. Wang, Z., Valdez, J., Basu Mallick, D., Baraniuk, R.G.: Towards Human-Like Educational Question Generation with Large Language Models. In: Rodrigo, M.M., Matsuda, N., Cristea, A.I., Dimitrova, V. (eds.) Artificial Intelligence in Education, pp. 153–166. Springer International Publishing, Cham (2022)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Steven Moore .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Moore, S., Costello, E., Nguyen, H.A., Stamper, J. (2024). An Automatic Question Usability Evaluation Toolkit. In: Olney, A.M., Chounta, IA., Liu, Z., Santos, O.C., Bittencourt, I.I. (eds) Artificial Intelligence in Education. AIED 2024. Lecture Notes in Computer Science(), vol 14830. Springer, Cham. https://doi.org/10.1007/978-3-031-64299-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-64299-9_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-64298-2

  • Online ISBN: 978-3-031-64299-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy