Leveraging Large Language Models To Generate Course-Specific Semantically Annotated Learning Objects
Leveraging Large Language Models To Generate Course-Specific Semantically Annotated Learning Objects
Journal Section
1
2 Lohr et al.
KEYWORDS
automated question generation, computer science, generative AI,
GPT-4, large language models, retrieval-augmented generation
1 | INTRODUCTION
The determinants of learning success have been extensively researched in various disciplines [1, 2]. Numerous em-
pirical studies support the hypothesis that tailoring learning materials to learner’s needs significantly increases the
effectiveness of learning outcomes [3, 4]. The mastery learning theory by Bloom [5] or the personalized system of
instruction theory by Keller [6] postulate teaching methodologies that highly build on individualized learning materials
and assessment on a very fine-grained level. However, resource limitations and an increasingly diverse educational
landscape challenge teachers and content creators in general.
Adaptive learning systems like ALEA [7] promise to address this need by delivering Learning Objects (LOs), such
as definitions, examples, or questions tailored to individual learners’ specific prior knowledge, competencies, and
preferences. One way of realizing this is by using semantically annotated LOs that allow for determining their pre-
requisites, which concepts they address, and which competencies they intend to foster. In conjunction with a model
of a learner’s current knowledge and competencies, this allows for many valuable services, such as the automatic
generation of flashcards or so-called guided tours: individually selected sequences of LOs (“learning paths”) that can
be automatically created by learners on demand by a simple click, to open up a specific concept in a specific cognitive
dimension. To make this possible, the pool of semantically annotated LOs must be just as diverse as the educational
biographies of the learners. However, developing sufficiently annotated LOs requires huge amounts of manual labor
and expertise. This bottleneck underlines the need for innovative solutions that can generate customized learning ma-
terials automatically, thus becoming a focus of current research in computer science education. Attempts to (partially)
automate the creation of quiz questions can be found in the literature as early as the 1970s [8].
Studies on the role of artificial intelligence in education point out that a significant hurdle is the deficiency of
suitable learning materials for individualized and adaptable learning [9], a problem that may be mitigated by the ca-
pabilities of large language models (LLMs). Research in computing education is exploring the capabilities of these
models to generate educational content that is both contextually appropriate and educationally demanding, offering
a promising solution to the constraints associated with manually creating content. However, the effectiveness of
LLM-generated educational content is often limited by their lack of focus on specific course content and the learner’s
current understanding. The generated LOs often do not fit within the intended educational framework or meet the
diverse needs of individual learners.
This paper presents the results of experiments on generating semantically annotated quiz questions using a state-
of-the-art LLM. In particular, we investigate the question of to what extent LLMs can be used to generate questions in
the domain of university-level computer science (CS) that are didactically valuable, are sufficiently annotated to allow
for the above selection process, can ideally be graded by a software system (e.g. multiple choice or fill-in-the-blanks
questions) to automatically and immediately increase the accuracy of the system’s learner models, and are entirely
situated in the context of a particular university-level course concerning terminology and notations used. Unlike
earlier similar studies [10, 11], our task requires extensive context, rendering it unsuitable for naive approaches, e.g.,
using ChatGPT. Instead, we use more targeted techniques beyond “prompt engineering”, such as retrieval-augmented
generation (RAG).
Lohr et al. 3
Overview Section 4 details our experimental methodology, including the considerations behind our choice of seman-
tic annotations (Section 4.1), the selection criteria for the LLM (Section 4.2), the design of our question generation
pipeline (Section 4.3) and the evaluation framework used, including the criteria for assessing question quality and an
expert-based survey methodology. The results of our experiments are presented and discussed in Section 5. Finally,
Section 6 concludes the paper with a summary of our findings, implications for future research in automated ques-
tion generation using LLMs, and potential pathways for enhancing the educational value of LLM-generated content
in adaptive learning environments.
2 | RELATED WORK
Over the past few decades, the process and methodology of automated question generation (AQG) have under-
gone significant transformations, driven primarily by advancements in computational linguistics and artificial intelli-
gence (AI). In the earlier stages, AQG relied heavily on rule-based systems that applied predefined templates and
linguistic patterns to generate questions from text, requiring extensive manual crafting and domain-specific adjust-
ments. First studies can be found by Wolfe [8].
With the development of deep learning and neural network models, a transition to more sophisticated, context-
aware systems can be recognized. These models, particularly sequence-to-sequence and transformer-based archi-
tectures, have enabled the generation of more nuanced, relevant, and diverse questions by “understanding” deeper
semantic relationships within the text. Kurdi et al. conducted a systematic review of empirical research focused on
addressing the issue of AQG within educational contexts [12]. They thoroughly outlined various methods of genera-
tion, tasks, and evaluation techniques found in studies between 2015 and early 2019. A standard method involves
identifying sentences in the text sources with high information content using topic words or key phrases. The system
then selects a keyword as the answer key, removes it to form a question stem, and generates incorrect choices (dis-
tractors) using a clustering method without external data. These systems mainly produce questions testing remember
factual knowledge, not understanding, a key focus of the work presented in this paper, and the results show that large
language models (LLMs) can potentially benefit from semantics-based approaches to generate meaningful questions
that are closely related to the source content.
Recent progress on generative natural language models has opened up new potentials in the generation of edu-
cational content [13, 14, 15, 16]. In recent years, more and more approaches have been found to generate tasks in a
single step using LLMs like GPT-3 instead of dividing the AQG process into several sub-tasks. Yan et al. did a system-
atic scoping review of articles published since 2017 to pinpoint the current state of research on using LLMs [17]. They
identified content generation, including multiple-choice questions and feedback generation, as primary educational
tasks that research aims to automate. McNichols et al. tested the effectiveness of LLM-based methods for automatic
distractors and feedback generation using a real-world dataset of authentic student responses [18]. Their findings
show that fine-tuning was ineffective and that other approaches than LLM-prompting need to be explored. Dijkstra
et al. [19] developed a quiz generator based on OpenAI’s GPT-3 model, fine-tuned on text-quiz pairs to generate
complete multiple-choice questions, with correct answers and distractors. They noted that while most of the gen-
erated questions were of acceptable quality, creating high-quality distractors was more challenging than generating
question-and-answer pairs.
More specifically, LLM-based approaches have recently been applied in programming education (for an intensive
literature review, see [13]). Sarsa et al. [11] explored the capabilities of OpenAI’s LLM Codex to generate program-
ming exercises and code explanations. Their results show that the generated questions and explanations were novel,
4 Lohr et al.
sensible, and, in some cases, ready to use. Tran et al. [20] evaluated the capabilities of OpenAI’s GPT-4 and GPT-3
models to generate isomorphic multiple choice questions (MCQs) based on MCQ-stems from a question bank and
an introductory computing course. Their findings underscore that the newer generation of LLMs outperforms older
generative models in AQG.
Research Gap
Reviewing the current landscape of research in AQG, several critical gaps emerge that necessitate further investiga-
tion. Current research shows that methods for generating closed-format tasks that target the cognitive dimension of
remember factual knowledge work well. However, generating tasks that target deeper understanding, such as those
that require comprehension or analytical skills, remains a significant challenge. We also could not find any research
on generating questions containing semantic annotations. While it is already known that LLMs have the potential for
generating feedback, and in particular code explanations, no work attempts to generate questions using LLMs that go
beyond introductory courses in computing education. Kurdi et al.[12] point out that there is little focus on controlling
question parameters like difficulty and generating feedback. Since we want to incorporate the generated questions
into an adaptive learning assistant, we have carefully considered which parameters are relevant for adaptive learning
– such as difficulty, cognitive dimension, and prior knowledge – and want to explore how these can be integrated
into the question generation process. We also did not find any studies that investigated generating questions using
advanced LLM approaches such as RAG, as it is crucial for scalability beyond the original training data of the model.
Additionally, our goal is to utilize the generated questions in an adaptive learning assistant that is capable of auto-
matically selecting suitable questions based on the associated learning objective and a particular student’s (estimated)
prior knowledge in the form of a learner model. Therefore, we need to annotate the questions with the relevant
information for this selection process, namely:
While this is less difficult for an experienced user, providing these annotations is still time-consuming and potentially
automatable. Therefore, we also investigate the extent to which a large language model can directly generate fully
annotated questions.
Figure 1 shows an example of a fully annotated multiple-choice question created by hand in LATEX following the
Y-model framework [22]. The relational semantic annotations allow the learner to, e.g., hover over concepts in the
text (after conversion to HTML) and display definitions (this functionality can, of course, be deactivated if desired). A
detailed description of the specific annotation schema can be found in Section 4.1.
Ensuring the quality of the generated tasks is a primary goal. As there is no standard metric in the literature
for measuring the quality of questions [23], we reviewed metric lists from literature reviews on automatic question
generation [24, 23] to identify relevant criteria that determine the quality of the questions we generate, namely:
Additionally, feedback is one of the most influencing factors for learning success [25], and recent experiments
show the potential of large language models for generating feedback in programming education [26, 27, 28, 29, 30].
A further goal is to generate questions containing feedback that helps learners understand why a given answer is
wrong.
The research questions are as follows:
(RQ1) To what extent can LLMs be used to generate university-suitable autogradable questions in CS education?
(RQ2) To what extent can LLMs be used to annotate these questions semantically?
6 Lohr et al.
4 | METHODOLOGY
To semantically annotate learning objects – including questions – we use the STEX package [31] for LATEX. STEX uses
an ontology based on OMDoc [32]: Concepts are represented as symbols that can be introduced via the \symdecl-
macro and can be related to each other in various ways. Symbols are always declared in modules (via the smodule-
environment), which can import (the symbols of) other modules, adding another layer of relations and allowing for
sharing concepts among large collections of disparate documents, thus enabling the collaborative and modular devel-
opment of domains of knowledge as highly interrelated knowledge graphs independent of, and across, presentational
context: symbols can have arbitrarily many definitions, formulations thereof, and different notations, to accommodate
author’s preferences without duplicating the domain knowledge itself.
Symbols can be referenced in various ways; most importantly, the \symref-macro allows annotating arbitrary
text as representing a particular symbol, and formal notations can be produced via dedicated semantic macros that
additionally associate the notation with the corresponding symbol. Since the details are largely irrelevant to the topic
of this paper, we refer to [33] for details and Section 4.4 for examples.
More importantly, for our purposes, STEX also bundles the problem-package, providing dedicated markup macros
and environments for various variants of quiz questions. Most importantly, it provides the sproblem-environment
to annotate questions, within which we can use the mcb and scb environments for multiple and single choice blocks,
respectively. As a third (autogradable) question type, the \fillinsol macro can be used to generate a blank box for
fill-in-the-blanks questions. In all three cases, we can mark (or provide) the correct answer(s), add feedback text to
be displayed if a particular answer (correct or wrong) is chosen, and specify grading actions (i.e., set, add or deduct
points) depending on the answers given. Additionally, we can specify the learning objectives and prerequisites as pairs
of a symbol and one of the keywords remember, understand, apply, analyze, evaluate, create. Symbols refer-
enced in the body of the question are automatically determined to be prerequisites with the cognitive dimension
remember. For example, if a module declares the symbol plus for addition on integers, then the question “what is 2
\symref{plus} {added to} 2?” automatically has the prerequisite (remember,plus). We refer to such annotations as
relational annotations and to those that do not relate text to some symbol (e.g. fillinsol or sproblem) as structural
annotations.
Besides being LATEX compilable to pdf, we can use the RUSTEX system [34] to convert the STEX documents to HTML,
preserving the semantic annotations in the form of attributes on the HTML nodes. Additionally, this assigns a globally
unique URI to the document itself and every section, module, symbol, and learning object therein, enabling searching,
querying, and referencing all of these afterward across documents.
Subsequently, we can embed various services that act on the semantic annotations directly into the HTML docu-
ments via JavaScript. Our learning assistant utilizes this to render quiz questions as interactive components, evaluate
answers provided by students, display the appropriate feedback provided, and update a student’s learner model ac-
cordingly (see Figure 1).
While our usage of STEX is primarily motivated by our learning assistant, for the purposes of this paper, it offers
additional advantages: Its underlying ontology, being designed around representing the semantics of mathematical
statements formally, is consequently sufficiently expressive to subsume most semantic annotation systems, in partic-
ular those tailored specifically for quiz questions. Our results should, therefore, (all else being equal) generalize to
other annotation schemes. Furthermore, since STEX inherits its syntax from LATEX, the LLM used does not need to be
finetuned or explicitly prompted on the usage of a new or esoteric language – the huge amount of publicly available
Lohr et al. 7
LATEX code implies that its syntax is well represented in the training data used for these models, and can therefore be
expected not to pose an issue for generation. Indeed, our experiments did not yield any LLM output with basic syntax
errors.
1. Targeting understanding rather than remembering factual knowledge implies that the model we use should be
capable of “synthesizing” complex knowledge, excluding, e.g., smaller language models less capable of “reasoning”.
2. Since the knowledge domain and preferred conventions should not be fixed, deliberately training a model on
dedicated course materials is largely not feasible, which calls for few-shot approaches. And
3. since the latter point also entails providing possibly large amounts of course materials online during generation,
the model used needs to be able to process large amounts of text at once. More precisely, it should allow for a
large enough context window – the maximum number of tokens the model can consider in a single prompt/reply
step.
OpenAI’s GPT-4 [35] shows consistently better results across all tasks compared to alternative large language
models1 . In particular, we initially experimented with the free GPT-3.5 model (that powers the free version of Chat-
GPT), which quickly demonstrated very poor performance both concerning the questions generated and, more broadly,
the ability to follow the instructions provided – meaning more prompt design was unlikely to improve the results sig-
nificantly.
Similarly, most models (including almost all open-source models) primarily focus on short conversations and, there-
fore, only support a relatively small context window. For example, one of the currently most popular open source lan-
guage models, LLaMA [36], can process a total input of 2048 tokens, whereas GPT-3.5 and GPT-4-Turbo have context
windows of up to 16,385 and 128,000 tokens, respectively.2
We consequently opted for the (unfortunately commercial, closed source, and proprietary) GTP-4-Turbo model
as the one holding the most promise with respect to our criteria at the time of the experiments. Unfortunately, this
also means we can not provide access to a public instance of our pipeline since the API used to access the model is
monitored and billed on a per-token basis.3
1 See, e.g., https://toloka.ai/blog/llm-leaderboard/ for a regularly updated comparison across several dimensions
2 It should be mentioned, that the precise meaning of “context window” is not necessarily clear. There is some speculation that the huge
context window advertised for GPT-4-Turbo involves neural compression and similar (potentially lossy) techniques to allow for more input
rather than representing the actual size of the underlying transformer model’s input layer.
3 A common problem is people actively crawling the internet for API keys and publicly accessible APIs to GPT models, intending to abuse those
for unintended purposes (e.g., via “prompt hacking”).
8 Lohr et al.
To explain how we use RAG, we give a brief overview of the entire question-generation pipeline. Note that we
deliberately leave room for variation:
• A course instructor selects some concept (represented by an STEX symbol), a cognitive dimension, a document of
course materials (i.e., the entire lecture notes) via its URI, and additional parameters (e.g., number of questions,
difficulty level, types of questions, a brief description of the course topics, ...).
• The system replaces placeholders in a generic “master prompt” (see Section 4.4) by the parameters provided.
• The system then selects those text fragments in the provided course materials that directly relate to the chosen
concept – e.g., definitions, examples, surrounding remarks, etc. – on the basis of the semantic annotations in
the document and appends them to the master prompt. Since STEX symbols are intrinsically relational, we can, in
principle, also add materials for dependent concepts down to some cutoff point. In practice, we currently pick the
entire chapter (or section or subsection) in which the concept is introduced.
• The final prompt is passed to the LLM and its output is presented to the instructor.
We focus here on the scenario where all parameters – in particular, the concept, cognitive dimension, and number
of questions – are explicitly fixed in the prompt, as this allows for better evaluation of the results of our experiments.
However, we note that we can easily generalize to broader applications by automatically selecting some of the param-
eters. For example, we can generate questions for a whole chapter along all cognitive dimensions by determining the
concepts introduced therein (which we can do automatically based on semantic annotations) and repeatedly running
the above pipeline with, e.g., randomly varied parameters.
We also chose not to split the task into multiple prompts for specific purposes. For example, we could conceptually
use three distinct prompts (or even distinct models) for 1. generating questions, 2. adding feedback for students, and
3. introducing semantic annotations. This split would be natural in a non-LLM-based approach since all three steps
require distinct methodologies and tools. However, when using the same model for all three steps, it is less clear
whether splitting the task is advantageous. We leave the question of whether it is for future work but note that our
initial (and admittedly superficial) experiments in that direction did not yield noticeably different results. Furthermore,
experiments with instructing the model to (paraphrased) first “think of a good question, summarize it, and then state
the question as fully annotated STEX” – conceptually splitting the task into two separate steps – did not seem to make
a clear difference in the quality of the output either. This is to some extent unexpected since it closely corresponds
to the popular “think step by step”-instruction that is considered to generally improve results in LLM prompts related
to reasoning.
Paraphrasing Zamfirescu-Pereira et al. [37], it helps thinking of the language model as an interpreter and prompting as
analogous to programming in the sense that it requires clear and detailed instructions to an entity keen to “do what
you say, not what you (clearly) mean”.
Consequently, our final prompt is in many places the result of flaws occurring in previous iterations, which we
will note where appropriate.
Our prompt starts with a brief summary of what we expect the model to do:
We then clarify what we mean by cognitive dimension. Note that the huge amounts of data the model was trained on
also contain many texts about concepts in education; particularly Bloom’s revised taxonomy [21]. Nevertheless, we
prefer to be precise:
We use the term " cognitive dimension " according to Bloom ’ s taxonomy . It is
provided as one of the strings " remember " , " understand " , " apply " , " analyse " ,
" evaluate " , or " create ". Here is a short explanation for the taxonomy ,
together with verbs that are commonly used in questions that fit this taxonomy
level :
- remember : recall facts and basic concepts ( verbs : define , duplicate , list ,
memorize , repeat , state )
- understand : ...
Note also, that we could save on tokens by restricting the prompt to only the one cognitive dimension that a user
actually selects in the pipeline. In practice, the instructional part of the prompt is only a small part of the final prompt;
the bulk of which consists of the learning materials provided. As such, there is little need to be conservative here, and
listing all dimensions might help with clearly delineating them.
Next, we give a brief explanation of the STEX syntax used by the learning objects provided as context:
The learning objects will be given as a list of LaTeX snippets , using the sTeX
package for semantic markup . In sTeX , concepts are declared using the \ symdecl
or \ symdef commands , and can be referred to using ...
Indeed, we can keep this part brief since the learning objects will naturally contain large amounts of STEX syntax that
(ideally) allow the model to pick up on the relevant macros and their usage. In particular, we can safely omit example
usages here.
We also make sure that sufficient information is provided in the prompt to ensure the reuse of symbols present
in the learning objects and the surrounding modules:
Every learning object will be prefixed with an id that indicates which file it
is from , in a format close to what is required in \ usemodule ...
10 Lohr et al.
Next, we explain precisely what kinds of questions we expect from the model, namely one of the three autograd-
able question types supported by STEX (multiple choice, single choice, or fill-in-the-blanks). We then provide examples for
each of them, so the model “knows” to replicate the relevant STEX macros and environments. We make sure that the
examples are representative of the output we expect from the model, i.e., they should contain as many detailed anno-
tations as possible, be (in our estimation) interesting and didactically valuable, and contain good feedback, especially
for the wrong answer options. Therefore, we also point out that adding feedback is desired:
An example for a multiple choice question is the following ; note the semantic
markup :
‘‘‘
\ begin { sproblem }
\ usemodule [ smglom / sets ]{ mod ? bijective }
\ usemodule [ smglom / sets ]{ mod ? relation - composition }
\ usemodule [ smglom / arithmetics ]{ mod ? natarith }
\ objective { understand }{ bijective }
\ objective { understand }{ injective }
\ objective { understand }{ surjective }
We delimit the STEX code example using the three backticks ‘‘‘, as is standard in, e.g., Markdown. In particular,
it is the same delimitation used by GPT (in particular ChatGPT) when producing code in its output.
Furthermore, we deliberately chose an example of a multiple choice question with more than one correct answer
– like (in our experience) human beings, GPT too seems to prefer having a single correct choice; in both cases likely
because it is easier to come up with questions where there is one correct answer. To direct the model to generate
more question types than just “single choice questions disguised as multiple choice”, we promote a variety of question
formats.
We subsequently provide similar examples for single choice and fill-in-the-blank questions. Finally, we reiterate
the expected outcomes and add specific criteria we want the questions to satisfy, starting with ones that seemed
important to us from the start:
Once you are given the data described above , you are to reply with the given
number of quiz questions as LaTeX code with semantic markup .
Make sure that the questions you generate satisfy the following criteria :
- The answer to a question should give a good indication regarding the extent
to which the student has mastered the concept with respect to the given
cognitive dimension .
- The questions should be as diverse as possible , i . e . they should test for
different aspects of the concept .
- Make sure you provide good feedback , especially for wrong answers , so that
students can learn what they did wrong .
- Assume the students know nothing about the concept other than what ’ s in the
learning objects provided .
- In single / multiple choice questions , make sure that the distractors are
superficially plausible to students who have not yet mastered the subject ,
as to not make the problems too easy .
- Feel free to use examples the students might know from elsewhere , e . g . basic
high school level knowledge , or foundational principles of the study program
the course is in .
Finally, we list the parameters provided by the user as replacement variables that are substituted by the system:
%%%%
concepts : %% CONCEPT %%
course : %% COURSE %%
...
learning objects : %% L E A R N I N G _ O B J E C T S %%
Occasionally, there was an excessive tendency in the output to refer to the course materials directly, resulting in
generated questions about e.g., specific names of variables in some example somewhere in the course notes. Hence,
we attempt to avoid questions in the output about irrelevant specifics of the course materials:
It also sometimes added feedback or explanations to the text of the answer choice, inevitably revealing the correct
and wrong answers in the questions itself:
- Do not put any text in the LaTeX code that directly states which answer is
correct - the sTeX macros used above take optional arguments explicitly for
that purpose .
In some instances, quiz questions were generated to ask for additional free-text answers, so we emphasize that
students will not be able to do so:
- Note that students are limited to replying to a question in the form the
question type is posed in , i . e . ticking boxes in single / multiple choice
questions , or filling in a short text in fill - in - the - blanks questions .
They can not provide any additional text .
The output rarely contained fill-in-the-blank questions. When we modify the prompt to explicitly encourage
those, problems quickly emerge, which we attempted to fix thusly:
This seemed to help somewhat, but the primary noticeable result was even fewer fill-in-the-blank questions.
This should not be entirely surprising: beyond the remember dimension, fill-in-the-blank questions (where answers
are literally string matched to a reference solution) are primarily helpful to ask for an answer that (ideally) can only
result from, e.g., correctly applying an algorithm, computing some expression, counting specific properties, etc. – i.e.,
they are very useful to make questions asking for complex reasoning and processes autogradable. These questions tend
to be particularly difficult to design, though. Hence, the model strongly prefers multiple and single-choice questions.
In general, the generated questions for the apply dimension were not satisfactory. One strategy we tried to
improve this was to explicitly point out the use of fill-in-the-blank questions in the prompt, but this did not lead to
any noticeable improvement.
Lohr et al. 13
4.5 | Evaluation
In the literature, there are two established techniques to evaluate the quality of generated questions [24]: (1) auto-
matic evaluation and (2) human-based evaluation.
Certain criteria outlined in Section 3 necessitate specific insights from the course context, accessible only to
individuals within the lecture environment. Similarly, assessing the quality of STEX annotations requires expertise and
cannot be effectively automated. Hence, an expert-based evaluation was selected – the most common evaluation
approach in the context of automatic question generation [24].
For this purpose, we (1) developed a structured survey to evaluate the quality of generated questions based on
the criteria above, (2) generated a collection of 30 questions, (3) conducted the survey with experts within the course
environment, and (4) conducted a qualitative analysis of the survey findings.
The survey’s introductory segment supplied experts with context by outlining the parameters used in the question
generation prompt, followed by the complete presentation of the generated question (GQ), incorporating any semantic
annotations.
A free-text response field was incorporated to evaluate the accuracy of the content and detect any potential
errors in the GQ. This section allowed experts to pinpoint specific content discrepancies or inaccuracies within the
GQ. Additionally, we employed a 5-point Likert scale, spanning from “Very Difficult” to “Very Easy,” to assess the
perceived level of difficulty of the GQ.
The survey consisted of six statements to assess the generated question’s different quality dimensions. Experts
were tasked with indicating their degree of agreement with these statements on a 7-point Likert scale, ranging from
“Strongly Disagree” to “Strongly Agree”. The statements were outlined as follows:
Upon completing the survey, the experts could expand their assessments by noting any additional irregularities
or remarks. This section, intentionally left open-ended, was aimed at allowing the experts to share extra observations,
recommendations, or issues they might have about the tasks presented.
To evaluate GQs systematically across multiple topics, we decided on generating a total of 30 questions across six
distinct subjects within a university course “Artificial Intelligence I”, focusing on topics in symbolic AI with otherwise
fixed parameters in the prompt (see Section 4.3).
14 Lohr et al.
Difficulty and Cognitive Dimension The majority of task generation systems documented in the literature focus
solely on generating factual questions that target the remember dimension [39]. However, since our main objective
in AGQ is to create learning materials capable of effectively enhancing our learner models, it is imperative to include
tasks that assess at least the cognitive level of understand. Thus, we have opted to confine our investigation in this
experiment to tasks that align with the understand dimension, assuming that the remaining dimensions are even
more challenging to design questions for in our context (as for humans, so for LLMs). We fixed the difficulty level
to be “medium” since we had already determined the model to largely generate relatively easy questions, which are
consequently less informative when assessing students’ mastery of a topic.
Number of Questions There is a trade-off regarding the number of questions generated per prompt at once: The
more questions generated, the more diverse they can be expected to be. In contrast, generating fewer per prompt
and prompting multiple times prevents immediate consideration of previously generated questions, making it difficult
to minimize overlap between questions. However, the model often refuses to generate too many questions at once;
it only replies with a few and instructs the user to prompt to yield more. The maximal number of GQs per prompt
seems to depend on the topic or, presumably, the amount of course materials added to the prompt, but up to five
questions at once seems to work consistently. We leave a more systematic investigation of this trade-off and the
resulting quality of sets of questions as a whole to future work and fixed the number to be five GQs per topic (for a
total of 30 GQs).
Topics We chose six topics from the course spread roughly uniformly across the table of contents of the course
notes, with an attempt of making some “broader” and others more “specific”; these being:
The question types of the GQs were dominated by single choice questions (12 out of 30), followed by multiple choice
questions (18 out of 30). The first consistently had three possible answers, while the latter almost always offered four
options, usually with three distractors. The constant number of answers may be because the examples in the prompt
also contain precisely this number of answers. With the MCQ, explicit care was taken to ensure that the example
provided contained more than one correct answer. Nevertheless, the generated MCQs almost exclusively contained
exactly three distractors. Notably, no fill-in-the-blank (FIB) questions appear in the generated questions. This is likely
because designing meaningful FIB questions is significantly more challenging for the understand dimension, especially
considering the necessary constraint that the answer be unambiguous.
The FIT of the GQs to the provided teaching material was evaluated by the experts as consistently given (28 out
of 30), and most of them were assessed as solvable using the available teaching material (27 out of 30). In most cases,
Lohr et al. 15
the GQs were evaluated as clearly and unambiguously formulated. Finally, most experts rated the GQs as relevant
for achieving the learning objective.
However, the quality of the provided feedback was mixed. Frequently, no feedback was generated for an answer
option. When it was, it was often not helpful in that it merely rephrased a wrong answer again – i.e., the feedback to
an incorrect choice X was a variant of “No, it is not the case that X ”. This instance of uninformative feedback has also
been observed and reported in other studies on LLM-feedback (e.g., [40]).
Eleven out of 30 GQs contained errors, mainly in the answer options and the feedback. These errors occurred particu-
larly frequently in the topics arc consistency and semantics of propositional logic. We can only speculate on the reasons
for this. Still, it seems likely that it correlates with the conceptual complexity of the topics or how well the topic is
represented in the model’s original training data – semantics requires a much better understanding of the meaning
of logical statements and the formal/mathematical mechanisms that provide statements with meaning (interpretation
functions, models and their formal definitions, etc), and there are multiple ways to formulate these mathematically. Sim-
ilarly, to understand arc consistency requires a solid intuition about how various constraints in a constraint satisfaction
problem interact with respect to the variables in a particular problem.
One crucial result to consider is that the model may produce superficially plausible questions that align with the
“aesthetic” of a good question but are false in not necessarily trivial ways.
One particularly striking example is the following question on the topic of propositional logic (generated in an
earlier experiment),
If we have already established ¬B, how can we use the Natural Deduction rule for implication (⇒) elimination to infer
a new formula?
□
X Given B ⇒ C , deduce ¬C
where the answer above was considered the intended and only correct one. This is the common fallacy of denying the
antecedent, but without understanding the semantics of propositional logic (or the natural deduction calculus) already,
this is not immediately obvious.
Notable here is that the question is not only wrong, but that it actively reinforces a common misconception that
may not necessarily be caught even by, e.g., student employees who assume they are sufficiently knowledgeable in
the domain to evaluate questions, which is arguably even more dangerous than a merely wrong question.
We can distinguish between two kinds of annotations in STEX: Structural annotations and relational annotations.
Structural annotations, such as \begin{sproblem}, \begin{mcb} (for MCQ-blocks),
\mcc[<T|F>,feedback={...}]{<answer>} (for an answer in a MCQ-block), \objective{<competency>}{<symbol>}
etc. should be present in all questions, or all questions of a particular question type, and are very schematic in the
sense that they are not particular to a specific question, topic or concept (other than the one explicitly provided as a
parameter to the prompt). These annotations seemingly pose no significant challenge to the model – the few examples
in the prompt are enough for the model to pick up on them immediately and use them consistently and correctly in
(almost) all generated questions. Rarely does it hallucinate plausible but non-existent symbol names in the objectives.
16 Lohr et al.
This is not the case for relational annotations, such as symbol or module references. To annotate a piece of text
with a specific symbol, one has to know the module it is declared in, import it (using \usemodule{...}), and then
annotate the text using the name of the symbol (e.g. “$a$ \sr{addition}{plus} $b$” annotates the word plus
with the symbol named “addition” which is currently in scope – globally, there may be multiple symbols with the same
name). Naturally, how to do such relational annotations can not be directly inferred from a few generic examples.
Instead, humans and AI models need a way to systematically look up symbols and their containing modules if we want
to be able to refer to them in annotations.
As mentioned in Section 4.1, we use these annotations to determine the prerequisites of learning objects, in-
cluding questions, so ideally, we would want the model to annotate as much as possible. Our initial hope that the
provided learning objects – which should have ample occurrences of the relevant symbol names and module imports
– would be enough for the model to “learn” to use them correctly turned out to be clearly false.4 Additional explicit
instructions in the prompt also did not improve matters.
We therefore attempted to use the canonical way of giving models the ability to look up additional information:
function calling and retrieval-augmented generation. OpenAI’s API for GPT-4 offers explicit support for providing the
model with a set of functions and determining which parameters they need. When prompted, the model will then
either return an output as usual, or a JSON object specifying (1) which function the model “wants” to call, and (2) the
parameters the call should use. It is then up to the caller of the API to implement functionality that augments the
following prompt with the correct return value for that function call.
We allowed the model to call a search function with a sequence of arbitrary strings as parameters. We augmented
the prompt by adding a paragraph that instructed the model to use this function to determine how to semantically
annotate any text that refers to a concept from the domain of the course. We then used existing functionality to
search for any definitions in our corpora and picked the top ten results for each of the string parameters. The search
results were subsequently appended to the prompt in a new delimited section, the instruction to call the function was
removed (to avoid potentially many subsequent calls or infinite loops), and the resulting prompt was fed back to the
model.
This, too, unfortunately, did not have the desired effect; on the contrary. The quality of the questions generated
got noticeably worse without improving the quality of the semantic annotations. We conjecture that by appending
the search result and making the total prompt significantly longer, individual parts of the prompt have less influence
on the output, e.g. the instructions at the beginning of the prompt. It is possible that more experimentation, possibly
including repeating the instructions multiple times throughout the prompt, can lead to better results in the long run.5
5.4 | Limitations
We used a single model (GPT-4-turbo) to generate the questions for the evaluation. At the time of the study (March
2024), this model was – in our opinion – the most suitable model (see Section 4.2 for a detailed explanation). However,
we explicitly do not rule out the possibility that there will be ongoing “better” models that are more suitable and can
lead to better results. A further limitation of this study is that we did not involve students in evaluating the GQ. To
obtain objective evidence of the quality of the tasks, a survey of students on the tasks would be helpful and is planned
for the future. Especially when evaluating multiple choice questions, for example, an assessment by (only) experts is
not sufficient. Whether a distractor is good or bad usually only becomes apparent when the exercise is presented to
4 One reason for that could be the unfortunately inconsistent usage of annotations in the course materials in general, improvements of which
happen frequently but are far from exhaustive.
5 According to anecdotal “LLM-folklore”, repeating important instructions in a long prompt is supposed to improve results.
Lohr et al. 17
students (who try to solve it). If learners never choose the distractor, it is a good indicator that it is too apparent and,
therefore, unsuitable. At the time of evaluation, however, there was no possibility of recruiting suitable students from
the course. Additionally, most generated questions were not “ready to use” on students without passing a “human
filter”. This means that we would no longer have evaluated the LLM’s capabilities, but the experts’ ability to filter the
generated questions sensibly.
Furthermore, the study did not include any control items. As we did not aim to find out whether the experts were
able to distinguish between LLM-generated and human-generated tasks the control items were of less importance.
Their only function would have been to normalize the expert’s rating, which was not necessary for our research
questions.
6 | CONCLUSION
In this experiment, we investigated the extent to which large language models are suitable for generating semantically
annotated quiz questions for higher education from semantically annotated course materials that are (1) suitable for
a specific course and (2) address understanding a concept.
Our research reveals significant limitations in the application of LLMs for this task. While generating questions that
address remembering factual knowledge works well, creating questions that address understanding remains a notable
challenge – which would need questions aiming for conceptual knowledge, or asking for explanations rather than recall
of a simple concept. Despite LLMs’ ability to generate a range of questions, domain experts’ need for extensive filtering
underscores the models’ inadequacies in autonomously generating educationally valuable content. The quality of
questions generated by LLMs often does not meet educational standards, primarily due to the simplistic nature of
the questions and the lack of valuable (and correct) feedback. Thus, the required careful review process by experts is
both time-consuming and counter to the goal of automated content creation. In addition, errors in the content of the
questions are challenging to detect for non-experts, as semantically incorrect output from LLMs is known to be very
close to the “truth” [41].
Our results show that generating structural, semantic annotations works well. However, this success was not
reflected in the case of relational annotations, which exhibited poor integration, indicating a limitation in the LLM’s
“ability” to contextualize and link concepts within the generated content effectively.
The results of the present work make clear that while LLMs can contribute to the pool of learning materials,
their current state requires significant human intervention to refine and validate the generated content. So, despite
the promises systems using LLMs give, the human-in-the-loop remains crucial. Although the prompt was carefully
designed and we provided a lot of static and dynamic context, the results still show weaknesses in quality and semantic
annotation. Nevertheless, further studies with a high number of generated tasks and more experts might show that
the automatic generation of questions following our approach could lead to more efficiency when used as an assistive
system for the preparation process.
Acknowledgements
The work reported in this article was conducted as part of the VoLL-KI (see https://voll-ki.de) funded by the
German Research/Education Ministry under grant 16DHBKI089. We also would like to thank Rakesh Kumar for the
work in developing a significant portion of the front end for the question generator.
18 Lohr et al.
references
[1] Hattie JAC. Visible learning: a synthesis of over 800 meta-analyses relating to achievement. New York: Routledge;
2009.
[2] Kirby JR, Lawson MJ, editors. Enhancing the Quality of Learning: Dispositions, Instruction, and Learning Processes.
Cambridge: Cambridge University Press; 2012.
[3] Sweller J. Cognitive Load During Problem Solving: Effects on Learning. Cognitive Science 1988;12(2):257–285.
[4] Sweller J, van Merrienboer JG, Paas FWC. Cognitive Architecture and Instructional Design. Educational Psychology
Review 1998;10(3):251–296.
[6] Keller FS. "Good-bye, teacher...". Journal of applied behavior analysis 1968;1(1):79–89.
[7] Kruse T, Berges M, Betzendahl J, Kohlhase M, Lohr D, Müller D. Learning with ALeA: Tailored Experiences through
Annotated Course Material. [object Object]; 2023.
[8] Wolfe JH. Automatic Question Generation from Text - an Aid to Independent Study. In: Proceedings of the ACM
SIGCSE-SIGCUE Technical Symposium on Computer Science and Education - Not Known: ACM Press; 1976. p. 104–
112.
[9] Denny P, Khosravi H, Hellas A, Leinonen J, Sarsa S. Can We Trust AI-Generated Educational Content? Comparative
Analysis of Human and AI-Generated Learning Resources. eprint 2023;.
[10] Song T, Tian Q, Xiao Y, Liu S. Automatic Generation of Multiple-Choice Questions for CS0 and CS1 Curricula Using
Large Language Models. In: Hong W, Kanaparan G, editors. Computer Science and Education. Computer Science and
Technology, vol. 2023 Singapore: Springer Nature Singapore; 2024.p. 314–324.
[11] Sarsa S, Denny P, Hellas A, Leinonen J. Automatic Generation of Programming Exercises and Code Explanations Using
Large Language Models. In: Proceedings of the 2022 ACM Conference on International Computing Education Research
- Volume 1 Lugano and Virtual Event Switzerland: ACM; 2022. p. 27–43.
[12] Kurdi G, Leo J, Parsia B, Sattler U, Al-Emari S. A Systematic Review of Automatic Question Generation for Educational
Purposes. International Journal of Artificial Intelligence in Education 2020 Mar;30(1):121–204.
[13] Prather J, Denny P, Leinonen J, Becker BA, Albluwi I, Craig M, et al. The Robots Are Here: Navigating the Generative AI
Revolution in Computing Education. In: Proceedings of the 2023 Working Group Reports on Innovation and Technology
in Computer Science Education Turku Finland: ACM; 2023. p. 108–159.
[14] Bhat S, Nguyen H, Moore S, Stamper J, Sakr M, Nyberg E. Towards Automated Generation and Evaluation of Questions
in Educational Domains. In: Proceedings of the 15th International Conference on Educational Data Mining International
Educational Data Mining Society; 2022. p. 701–704.
[15] Wang Z, Valdez J, Basu Mallick D, Baraniuk RG. Towards Human-Like Educational Question Generation with Large
Language Models. In: Rodrigo MM, Matsuda N, Cristea AI, Dimitrova V, editors. Artificial Intelligence in Education, vol.
13355 Cham: Springer International Publishing; 2022.p. 153–166.
[16] Wong LH, Park H, Looi CK. From Hype to Insight: Exploring ChatGPT ’s Early Footprint in Education via Altmetrics and
Bibliometrics. Journal of Computer Assisted Learning 2024 Feb;p. jcal.12962.
[17] Yan L, Sha L, Zhao L, Li Y, Martinez-Maldonado R, Chen G, et al. Practical and Ethical Challenges of Large Language
Models in Education: A Systematic Scoping Review. British Journal of Educational Technology 2024 Jan;55(1):90–112.
Lohr et al. 19
[18] McNichols H, Feng W, Lee J, Scarlatos A, Smith D, Woodhead S, et al. Exploring Automated Distractor and Feedback
Generation for Math Multiple-choice Questions via In-context Learning. eprint 2023;.
[19] Dijkstra R, Genç Z, Kayal S, Kamps J. Reading Comprehension Quiz Generation Using Generative Pre-trained Transform-
ers. In: Proceedings of the Fourth International Workshop on Intelligent Textbooks 2022, vol. 4 Durham, UK: Aachen:
CEUR-WS; 2022. p. 4–17.
[20] Tran A, Angelikas K, Rama E, Okechukwu C, Smith DH, MacNeil S. Generating Multiple Choice Questions for Computing
Courses Using Large Language Models. In: 2023 IEEE Frontiers in Education Conference (FIE) College Station, TX, USA:
IEEE; 2023. p. 1–8.
[21] Anderson LW, Krathwohl DR, editors. A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxon-
omy of Educational Objectives. Complete ed ed. New York: Longman; 2001.
[22] Lohr D, Berges M, Kohlhase M, Müller D, Rapp M. The Y-Model - Formalization of Computer Science Tasks in the
Context of Adaptive Learning Systems. In: 2023 IEEE 2nd German Education Conference (GECon) Berlin, Germany:
IEEE; 2023. p. 1–6.
[23] Ch DR, Saha SK. Automatic Multiple Choice Question Generation From Text: A Survey. IEEE Transactions on Learning
Technologies 2020 Jan;13(1):14–25.
[24] Mulla N, Gharpure P. Automatic Question Generation: A Review of Methodologies, Datasets, Evaluation Metrics, and
Applications. Progress in Artificial Intelligence 2023 Mar;12(1):1–32.
[25] Hattie J, Timperley H. The Power of Feedback. Review of Educational Research 2007 Mar;77(1):81–112.
[26] Kiesler N, Lohr D, Keuning H. Exploring the Potential of Large Language Models to Generate Formative Programming
Feedback. In: 2023 IEEE Frontiers in Education Conference (FIE) College Station, TX, USA: IEEE; 2023. p. 1–5.
[27] Lohr D, Kiesler N, Keuning H, Jeuring J. “Let Them Try to Figure It Out First” - Reasons Why Experts (Do Not) Pro-
vide Feedback to Novice Programmers. In: Proceedings of the 2024 Innovation and Technology in Computer Science
Education (ITiCSE 2024), vol. 1 Milan, Italy: ACM; 2024. p. 7.
[28] Hellas A, Leinonen J, Sarsa S, Koutcheme C, Kujanpää L, Sorva J. Exploring the Responses of Large Language Models
to Beginner Programmers’ Help Requests. In: Proceedings of the 2023 ACM Conference on International Computing
Education Research V.1 Chicago IL USA: ACM; 2023. p. 93–105.
[29] Balse R, Valaboju B, Singhal S, Warriem JM, Prasad P. Investigating the Potential of GPT-3 in Providing Feedback for
Programming Assessments. In: Proceedings of the 2023 Conference on Innovation and Technology in Computer Science
Education V. 1 Turku Finland: ACM; 2023. p. 292–298.
[30] Matelsky JK, Parodi F, Liu T, Lange RD, Kording KP. A Large Language Model-Assisted Education Tool to Provide Feed-
back on Open-Ended Responses. eprint 2023;.
[31] Kohlhase M, Müller D. System Description: sTeX3 – A LATEX-based Ecosystem for Semantic/Active Mathematical Docu-
ments. In: Buzzard K, Kutsia T, editors. Intelligent Computer Mathematics (CICM) 2022, vol. 13467 of LNAI Springer;
2022. p. 184–188. https://kwarc.info/people/dmueller/pubs/cicm22stexsd.pdf.
[32] Kohlhase M. OMDoc – An open markup format for mathematical documents [Version 1.2]. No. 4180 in LNAI, Springer
Verlag; 2006. http://omdoc.org/pubs/omdoc1.2.pdf.
[34] Müller D. An HTML/CSS schema for TEX primitives – generating high-quality responsive HTML from generic TEX. In:
TUGboat; TUG 2023 Conference Proceedings, vol. 44; 2023. p. 275–286. https://kwarc.info/people/dmueller/pubs/
tug23.pdf.
20 Lohr et al.
[36] Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al., LLaMA: Open and Efficient Foundation Language
Models; 2023.
[37] Zamfirescu-Pereira JD, Wong RY, Hartmann B, Yang Q. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail)
to Design LLM Prompts. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems CHI
’23, New York, NY, USA: Association for Computing Machinery; 2023. https://doi.org/10.1145/3544548.3581388.
[39] Kumar AP, Nayak A, K MS, Chaitanya, Ghosh K. A Novel Framework for the Generation of Multiple Choice Question
Stems Using Semantic and Machine-Learning Techniques. International Journal of Artificial Intelligence in Education
2023 Mar;.
[40] Lohr D, Keuning H, Kiesler N. You’re (Not) My Type – Can LLMs Generate Feedback of Specific Types for Introductory
Programming Tasks? Journal of Computer Assisted Learning (JCAL) 2024;Accepted.
[41] Sobieszek A, Price T. Playing Games with Ais: The Limits of GPT-3 and Similar Large Language Models. Minds and
Machines 2022 Jun;32(2):341–364.