0% found this document useful (0 votes)

20 views20 pages

Leveraging Large Language Models To Generate Course-Specific Semantically Annotated Learning Objects

This paper investigates the use of large language models (LLMs) for generating semantically annotated quiz questions in computer science education. The study highlights the potential of LLMs to automate the creation of educational content while addressing the challenges of ensuring quality and relevance to specific courses. Results indicate that while LLMs can produce structural annotations effectively, they often fall short in generating high-quality educational questions without significant human intervention.

Uploaded by

mahnoorarshad311002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views20 pages

Leveraging Large Language Models To Generate Course-Specific Semantically Annotated Learning Objects

Uploaded by

mahnoorarshad311002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

ORIGINAL ARTICLE

Journal Section

Leveraging Large Language Models to Generate

arXiv:2412.04185v1 [cs.AI] 5 Dec 2024

Course-speciﬁc Semantically Annotated Learning

Objects

Dominic Lohr1† | Marc Berges1† | Abhishek Chugh3 |

Michael Kohlhase2† | Dennis Müller2†

1 Professorship for Computer Science

Education Background: Over the past few decades, the process and
2 Professorship
for Knowledge methodology of automated question generation (AQG) have
Representation and Management
undergone significant transformations. Recent progress in
3 sophize.org
generative natural language models has opened up new po-
Correspondence tential in the generation of educational content.
Dominic Lohr, Professorship for Computer
Science Education,
Objectives: This paper explores the potential of large lan-
Friedrich-Alexander-Universität guage models (LLMs) for generating computer science ques-
Erlangen-Nürnberg, 91058, Germany
Email: dominic.lohr@fau.de
tions that are sufficiently annotated for automatic learner
model updates, are fully situated in the context of a particu-
Present address
† Department of Computer Science, lar course, and address the cognitive dimension understand.
Martensst. 3, Methods: Unlike previous attempts that might use basic
Friedrich-Alexander-Universität
Erlangen-Nürnberg, Germany
methods like ChatGPT, our approach involves more targeted
strategies such as retrieval-augmented generation (RAG) to
Funding information
Federal Ministry of Education and Research
produce contextually relevant and pedagogically meaning-
Germany, Grant: 16DHBKI089 ful learning objects.
Results and Conclusions: Our results show that generat-
ing structural, semantic annotations works well. However,
this success was not reflected in the case of relational an-
notations. The quality of the generated questions often did
not meet educational standards, highlighting that although
LLMs can contribute to the pool of learning materials, their
current level of performance requires significant human in-

1
2 Lohr et al.

tervention to reﬁne and validate the generated content.

KEYWORDS
automated question generation, computer science, generative AI,
GPT-4, large language models, retrieval-augmented generation

1 | INTRODUCTION

The determinants of learning success have been extensively researched in various disciplines [1, 2]. Numerous em-
pirical studies support the hypothesis that tailoring learning materials to learner’s needs significantly increases the
effectiveness of learning outcomes [3, 4]. The mastery learning theory by Bloom [5] or the personalized system of
instruction theory by Keller [6] postulate teaching methodologies that highly build on individualized learning materials
and assessment on a very fine-grained level. However, resource limitations and an increasingly diverse educational
landscape challenge teachers and content creators in general.
Adaptive learning systems like ALEA [7] promise to address this need by delivering Learning Objects (LOs), such
as definitions, examples, or questions tailored to individual learners’ specific prior knowledge, competencies, and
preferences. One way of realizing this is by using semantically annotated LOs that allow for determining their pre-
requisites, which concepts they address, and which competencies they intend to foster. In conjunction with a model
of a learner’s current knowledge and competencies, this allows for many valuable services, such as the automatic
generation of flashcards or so-called guided tours: individually selected sequences of LOs (“learning paths”) that can
be automatically created by learners on demand by a simple click, to open up a specific concept in a specific cognitive
dimension. To make this possible, the pool of semantically annotated LOs must be just as diverse as the educational
biographies of the learners. However, developing sufficiently annotated LOs requires huge amounts of manual labor
and expertise. This bottleneck underlines the need for innovative solutions that can generate customized learning ma-
terials automatically, thus becoming a focus of current research in computer science education. Attempts to (partially)
automate the creation of quiz questions can be found in the literature as early as the 1970s [8].
Studies on the role of artificial intelligence in education point out that a significant hurdle is the deficiency of
suitable learning materials for individualized and adaptable learning [9], a problem that may be mitigated by the ca-
pabilities of large language models (LLMs). Research in computing education is exploring the capabilities of these
models to generate educational content that is both contextually appropriate and educationally demanding, offering
a promising solution to the constraints associated with manually creating content. However, the effectiveness of
LLM-generated educational content is often limited by their lack of focus on specific course content and the learner’s
current understanding. The generated LOs often do not fit within the intended educational framework or meet the
diverse needs of individual learners.
This paper presents the results of experiments on generating semantically annotated quiz questions using a state-
of-the-art LLM. In particular, we investigate the question of to what extent LLMs can be used to generate questions in
the domain of university-level computer science (CS) that are didactically valuable, are sufficiently annotated to allow
for the above selection process, can ideally be graded by a software system (e.g. multiple choice or fill-in-the-blanks
questions) to automatically and immediately increase the accuracy of the system’s learner models, and are entirely
situated in the context of a particular university-level course concerning terminology and notations used. Unlike
earlier similar studies [10, 11], our task requires extensive context, rendering it unsuitable for naive approaches, e.g.,
using ChatGPT. Instead, we use more targeted techniques beyond “prompt engineering”, such as retrieval-augmented
generation (RAG).
Lohr et al. 3

Overview Section 4 details our experimental methodology, including the considerations behind our choice of seman-
tic annotations (Section 4.1), the selection criteria for the LLM (Section 4.2), the design of our question generation
pipeline (Section 4.3) and the evaluation framework used, including the criteria for assessing question quality and an
expert-based survey methodology. The results of our experiments are presented and discussed in Section 5. Finally,
Section 6 concludes the paper with a summary of our ﬁndings, implications for future research in automated ques-
tion generation using LLMs, and potential pathways for enhancing the educational value of LLM-generated content
in adaptive learning environments.

2 | RELATED WORK

Over the past few decades, the process and methodology of automated question generation (AQG) have under-
gone significant transformations, driven primarily by advancements in computational linguistics and artificial intelli-
gence (AI). In the earlier stages, AQG relied heavily on rule-based systems that applied predefined templates and
linguistic patterns to generate questions from text, requiring extensive manual crafting and domain-specific adjust-
ments. First studies can be found by Wolfe [8].
With the development of deep learning and neural network models, a transition to more sophisticated, context-
aware systems can be recognized. These models, particularly sequence-to-sequence and transformer-based archi-
tectures, have enabled the generation of more nuanced, relevant, and diverse questions by “understanding” deeper
semantic relationships within the text. Kurdi et al. conducted a systematic review of empirical research focused on
addressing the issue of AQG within educational contexts [12]. They thoroughly outlined various methods of genera-
tion, tasks, and evaluation techniques found in studies between 2015 and early 2019. A standard method involves
identifying sentences in the text sources with high information content using topic words or key phrases. The system
then selects a keyword as the answer key, removes it to form a question stem, and generates incorrect choices (dis-
tractors) using a clustering method without external data. These systems mainly produce questions testing remember
factual knowledge, not understanding, a key focus of the work presented in this paper, and the results show that large
language models (LLMs) can potentially benefit from semantics-based approaches to generate meaningful questions
that are closely related to the source content.
Recent progress on generative natural language models has opened up new potentials in the generation of edu-
cational content [13, 14, 15, 16]. In recent years, more and more approaches have been found to generate tasks in a
single step using LLMs like GPT-3 instead of dividing the AQG process into several sub-tasks. Yan et al. did a system-
atic scoping review of articles published since 2017 to pinpoint the current state of research on using LLMs [17]. They
identified content generation, including multiple-choice questions and feedback generation, as primary educational
tasks that research aims to automate. McNichols et al. tested the effectiveness of LLM-based methods for automatic
distractors and feedback generation using a real-world dataset of authentic student responses [18]. Their findings
show that fine-tuning was ineffective and that other approaches than LLM-prompting need to be explored. Dijkstra
et al. [19] developed a quiz generator based on OpenAI’s GPT-3 model, fine-tuned on text-quiz pairs to generate
complete multiple-choice questions, with correct answers and distractors. They noted that while most of the gen-
erated questions were of acceptable quality, creating high-quality distractors was more challenging than generating
question-and-answer pairs.
More specifically, LLM-based approaches have recently been applied in programming education (for an intensive
literature review, see [13]). Sarsa et al. [11] explored the capabilities of OpenAI’s LLM Codex to generate program-
ming exercises and code explanations. Their results show that the generated questions and explanations were novel,
4 Lohr et al.

sensible, and, in some cases, ready to use. Tran et al. [20] evaluated the capabilities of OpenAI’s GPT-4 and GPT-3
models to generate isomorphic multiple choice questions (MCQs) based on MCQ-stems from a question bank and
an introductory computing course. Their ﬁndings underscore that the newer generation of LLMs outperforms older
generative models in AQG.

Research Gap
Reviewing the current landscape of research in AQG, several critical gaps emerge that necessitate further investiga-
tion. Current research shows that methods for generating closed-format tasks that target the cognitive dimension of
remember factual knowledge work well. However, generating tasks that target deeper understanding, such as those
that require comprehension or analytical skills, remains a significant challenge. We also could not find any research
on generating questions containing semantic annotations. While it is already known that LLMs have the potential for
generating feedback, and in particular code explanations, no work attempts to generate questions using LLMs that go
beyond introductory courses in computing education. Kurdi et al.[12] point out that there is little focus on controlling
question parameters like difficulty and generating feedback. Since we want to incorporate the generated questions
into an adaptive learning assistant, we have carefully considered which parameters are relevant for adaptive learning
– such as difficulty, cognitive dimension, and prior knowledge – and want to explore how these can be integrated
into the question generation process. We also did not find any studies that investigated generating questions using
advanced LLM approaches such as RAG, as it is crucial for scalability beyond the original training data of the model.

3 | REQUIREMENTS FOR THE GENERATED QUESTIONS

We posit that automated question generation (AQG) in the context of university courses poses additional challenges
barely (or not at all) covered by the existing literature, especially in a domain like computer science:
Firstly, we claim that in more abstract domains like mathematics and math-related subfields of computer science,
questions solely focusing on remembering facts, or rudimentary application exercises are less suitable for the out-
comes of a university-level class. This makes designing appropriate questions a more challenging task because doing
so requires a level of understanding of and experience with the learning material.
Secondly, although the topics covered in a particular course are common across different universities, the details
are much less standardized (e.g., precise definitions, terminology, and notational conventions). This entails the addi-
tional requirement that questions (whether automatically generated or not) need to be formulated and situated in line
with the conventions in a particular course, which are also more frequently subject to change. This makes approaches
based on large amounts of dedicated training data, which necessarily do not generalize beyond a specific course and
instructor’s preferences, unsuitable. Instead, we should be able to provide the generator with the relevant learning
materials (course notes, slides, etc.) whenever we want to generate a new batch of questions.

Additionally, our goal is to utilize the generated questions in an adaptive learning assistant that is capable of auto-
matically selecting suitable questions based on the associated learning objective and a particular student’s (estimated)
prior knowledge in the form of a learner model. Therefore, we need to annotate the questions with the relevant
information for this selection process, namely:

1. The concepts the question is intended to test,

2. the cognitive level the question targets (modeled as levels in Bloom’s revised taxonomy [21]), and
3. the prerequisite concepts occurring in the question and the associated competencies that a user should have
mastered for the question to be suitable.
Lohr et al. 5

FIGURE 1 Source Code and Representation in the system

While this is less difficult for an experienced user, providing these annotations is still time-consuming and potentially
automatable. Therefore, we also investigate the extent to which a large language model can directly generate fully
annotated questions.
Figure 1 shows an example of a fully annotated multiple-choice question created by hand in LATEX following the
Y-model framework [22]. The relational semantic annotations allow the learner to, e.g., hover over concepts in the
text (after conversion to HTML) and display definitions (this functionality can, of course, be deactivated if desired). A
detailed description of the specific annotation schema can be found in Section 4.1.
Ensuring the quality of the generated tasks is a primary goal. As there is no standard metric in the literature
for measuring the quality of questions [23], we reviewed metric lists from literature reviews on automatic question
generation [24, 23] to identify relevant criteria that determine the quality of the questions we generate, namely:

• correct technical language,

• appropriateness for a particular course context, including deﬁnitions, annotations, and conventions,
• feasibility of solving the question with the provided details and available course materials,
• clarity and lack of ambiguity,
• relevance to achieving the intended learning outcomes,
• alignment with the speciﬁc format of the task, and
• automatically gradable (closed format).

Additionally, feedback is one of the most inﬂuencing factors for learning success [25], and recent experiments
show the potential of large language models for generating feedback in programming education [26, 27, 28, 29, 30].
A further goal is to generate questions containing feedback that helps learners understand why a given answer is
wrong.
The research questions are as follows:
(RQ1) To what extent can LLMs be used to generate university-suitable autogradable questions in CS education?
(RQ2) To what extent can LLMs be used to annotate these questions semantically?
6 Lohr et al.

4 | METHODOLOGY

4.1 | Semantic Annotations in STEX

To semantically annotate learning objects – including questions – we use the STEX package [31] for LATEX. STEX uses
an ontology based on OMDoc [32]: Concepts are represented as symbols that can be introduced via the \symdecl-
macro and can be related to each other in various ways. Symbols are always declared in modules (via the smodule-
environment), which can import (the symbols of) other modules, adding another layer of relations and allowing for
sharing concepts among large collections of disparate documents, thus enabling the collaborative and modular devel-
opment of domains of knowledge as highly interrelated knowledge graphs independent of, and across, presentational
context: symbols can have arbitrarily many definitions, formulations thereof, and different notations, to accommodate
author’s preferences without duplicating the domain knowledge itself.
Symbols can be referenced in various ways; most importantly, the \symref-macro allows annotating arbitrary
text as representing a particular symbol, and formal notations can be produced via dedicated semantic macros that
additionally associate the notation with the corresponding symbol. Since the details are largely irrelevant to the topic
of this paper, we refer to [33] for details and Section 4.4 for examples.
More importantly, for our purposes, STEX also bundles the problem-package, providing dedicated markup macros
and environments for various variants of quiz questions. Most importantly, it provides the sproblem-environment
to annotate questions, within which we can use the mcb and scb environments for multiple and single choice blocks,
respectively. As a third (autogradable) question type, the \fillinsol macro can be used to generate a blank box for
fill-in-the-blanks questions. In all three cases, we can mark (or provide) the correct answer(s), add feedback text to
be displayed if a particular answer (correct or wrong) is chosen, and specify grading actions (i.e., set, add or deduct
points) depending on the answers given. Additionally, we can specify the learning objectives and prerequisites as pairs
of a symbol and one of the keywords remember, understand, apply, analyze, evaluate, create. Symbols refer-
enced in the body of the question are automatically determined to be prerequisites with the cognitive dimension
remember. For example, if a module declares the symbol plus for addition on integers, then the question “what is 2
\symref{plus} {added to} 2?” automatically has the prerequisite (remember,plus). We refer to such annotations as
relational annotations and to those that do not relate text to some symbol (e.g. fillinsol or sproblem) as structural
annotations.
Besides being LATEX compilable to pdf, we can use the RUSTEX system [34] to convert the STEX documents to HTML,
preserving the semantic annotations in the form of attributes on the HTML nodes. Additionally, this assigns a globally
unique URI to the document itself and every section, module, symbol, and learning object therein, enabling searching,
querying, and referencing all of these afterward across documents.
Subsequently, we can embed various services that act on the semantic annotations directly into the HTML docu-
ments via JavaScript. Our learning assistant utilizes this to render quiz questions as interactive components, evaluate
answers provided by students, display the appropriate feedback provided, and update a student’s learner model ac-
cordingly (see Figure 1).
While our usage of STEX is primarily motivated by our learning assistant, for the purposes of this paper, it offers
additional advantages: Its underlying ontology, being designed around representing the semantics of mathematical
statements formally, is consequently sufficiently expressive to subsume most semantic annotation systems, in partic-
ular those tailored specifically for quiz questions. Our results should, therefore, (all else being equal) generalize to
other annotation schemes. Furthermore, since STEX inherits its syntax from LATEX, the LLM used does not need to be
finetuned or explicitly prompted on the usage of a new or esoteric language – the huge amount of publicly available
Lohr et al. 7

LATEX code implies that its syntax is well represented in the training data used for these models, and can therefore be
expected not to pose an issue for generation. Indeed, our experiments did not yield any LLM output with basic syntax
errors.

4.2 | Model Selection

Our goals put notable constraints on the methodology for generating questions:

1. Targeting understanding rather than remembering factual knowledge implies that the model we use should be
capable of “synthesizing” complex knowledge, excluding, e.g., smaller language models less capable of “reasoning”.
2. Since the knowledge domain and preferred conventions should not be ﬁxed, deliberately training a model on
dedicated course materials is largely not feasible, which calls for few-shot approaches. And
3. since the latter point also entails providing possibly large amounts of course materials online during generation,
the model used needs to be able to process large amounts of text at once. More precisely, it should allow for a
large enough context window – the maximum number of tokens the model can consider in a single prompt/reply
step.

OpenAI’s GPT-4 [35] shows consistently better results across all tasks compared to alternative large language
models1 . In particular, we initially experimented with the free GPT-3.5 model (that powers the free version of Chat-
GPT), which quickly demonstrated very poor performance both concerning the questions generated and, more broadly,
the ability to follow the instructions provided – meaning more prompt design was unlikely to improve the results sig-
niﬁcantly.
Similarly, most models (including almost all open-source models) primarily focus on short conversations and, there-
fore, only support a relatively small context window. For example, one of the currently most popular open source lan-
guage models, LLaMA [36], can process a total input of 2048 tokens, whereas GPT-3.5 and GPT-4-Turbo have context
windows of up to 16,385 and 128,000 tokens, respectively.2
We consequently opted for the (unfortunately commercial, closed source, and proprietary) GTP-4-Turbo model
as the one holding the most promise with respect to our criteria at the time of the experiments. Unfortunately, this
also means we can not provide access to a public instance of our pipeline since the API used to access the model is
monitored and billed on a per-token basis.3

4.3 | Overview of the Generation Pipeline

To allow for the model to generate questions for a specific course regarding terminology, definitions, and notations, we
opted for a technique called retrieval-augmented generation (RAG). This technique gives models access to additional
information beyond their training data by querying an external knowledge base (such as a database or a web search
engine) for results relevant to the specific prompt. It concatenates them to the prompt itself before passing it into the
model.

1 See, e.g., https://toloka.ai/blog/llm-leaderboard/ for a regularly updated comparison across several dimensions
2 It should be mentioned, that the precise meaning of “context window” is not necessarily clear. There is some speculation that the huge
context window advertised for GPT-4-Turbo involves neural compression and similar (potentially lossy) techniques to allow for more input
rather than representing the actual size of the underlying transformer model’s input layer.
3 A common problem is people actively crawling the internet for API keys and publicly accessible APIs to GPT models, intending to abuse those
for unintended purposes (e.g., via “prompt hacking”).
8 Lohr et al.

To explain how we use RAG, we give a brief overview of the entire question-generation pipeline. Note that we
deliberately leave room for variation:

• A course instructor selects some concept (represented by an STEX symbol), a cognitive dimension, a document of
course materials (i.e., the entire lecture notes) via its URI, and additional parameters (e.g., number of questions,
difficulty level, types of questions, a brief description of the course topics, ...).
• The system replaces placeholders in a generic “master prompt” (see Section 4.4) by the parameters provided.
• The system then selects those text fragments in the provided course materials that directly relate to the chosen
concept – e.g., definitions, examples, surrounding remarks, etc. – on the basis of the semantic annotations in
the document and appends them to the master prompt. Since STEX symbols are intrinsically relational, we can, in
principle, also add materials for dependent concepts down to some cutoff point. In practice, we currently pick the
entire chapter (or section or subsection) in which the concept is introduced.
• The final prompt is passed to the LLM and its output is presented to the instructor.

We focus here on the scenario where all parameters – in particular, the concept, cognitive dimension, and number
of questions – are explicitly ﬁxed in the prompt, as this allows for better evaluation of the results of our experiments.
However, we note that we can easily generalize to broader applications by automatically selecting some of the param-
eters. For example, we can generate questions for a whole chapter along all cognitive dimensions by determining the
concepts introduced therein (which we can do automatically based on semantic annotations) and repeatedly running
the above pipeline with, e.g., randomly varied parameters.

We also chose not to split the task into multiple prompts for specific purposes. For example, we could conceptually
use three distinct prompts (or even distinct models) for 1. generating questions, 2. adding feedback for students, and
3. introducing semantic annotations. This split would be natural in a non-LLM-based approach since all three steps
require distinct methodologies and tools. However, when using the same model for all three steps, it is less clear
whether splitting the task is advantageous. We leave the question of whether it is for future work but note that our
initial (and admittedly superficial) experiments in that direction did not yield noticeably different results. Furthermore,
experiments with instructing the model to (paraphrased) first “think of a good question, summarize it, and then state
the question as fully annotated STEX” – conceptually splitting the task into two separate steps – did not seem to make
a clear difference in the quality of the output either. This is to some extent unexpected since it closely corresponds
to the popular “think step by step”-instruction that is considered to generally improve results in LLM prompts related
to reasoning.

4.4 | Prompt Design

Despite ongoing attempts to declare “prompt engineering” a marketable skill, few hard principles for designing LLM
prompts can be consistently well supported empirically. The most important rules of thumb can be brieﬂy summarized
thusly [37, 38]:

1. Be speciﬁc in what the output should look like,

2. be detailed by including any and all information and requirements relevant to the expected output,
3. provide examples in the prompt, eﬀectively transforming the task from a zero-shot to a few-shot approach, and,
most importantly,
4. iterate by repeatedly testing, observing problems with the outputs, and modifying the prompt to discourage these
problems in subsequent iterations.
Lohr et al. 9

Paraphrasing Zamfirescu-Pereira et al. [37], it helps thinking of the language model as an interpreter and prompting as
analogous to programming in the sense that it requires clear and detailed instructions to an entity keen to “do what
you say, not what you (clearly) mean”.
Consequently, our final prompt is in many places the result of flaws occurring in previous iterations, which we
will note where appropriate.

Our prompt starts with a brief summary of what we expect the model to do:

Your task is to generate quiz questions that evaluate the competency of

university students with respect to a number of particular concepts in a
particular university course . You will be given
- the names of the concepts ,
- the name of the course ,
- a cognitive dimension to test for ,
- ...

We then clarify what we mean by cognitive dimension. Note that the huge amounts of data the model was trained on
also contain many texts about concepts in education; particularly Bloom’s revised taxonomy [21]. Nevertheless, we
prefer to be precise:

We use the term " cognitive dimension " according to Bloom ’ s taxonomy . It is
provided as one of the strings " remember " , " understand " , " apply " , " analyse " ,
" evaluate " , or " create ". Here is a short explanation for the taxonomy ,
together with verbs that are commonly used in questions that fit this taxonomy
level :
- remember : recall facts and basic concepts ( verbs : define , duplicate , list ,
memorize , repeat , state )
- understand : ...

Note also, that we could save on tokens by restricting the prompt to only the one cognitive dimension that a user
actually selects in the pipeline. In practice, the instructional part of the prompt is only a small part of the ﬁnal prompt;
the bulk of which consists of the learning materials provided. As such, there is little need to be conservative here, and
listing all dimensions might help with clearly delineating them.
Next, we give a brief explanation of the STEX syntax used by the learning objects provided as context:

The learning objects will be given as a list of LaTeX snippets , using the sTeX
package for semantic markup . In sTeX , concepts are declared using the \ symdecl
or \ symdef commands , and can be referred to using ...

Indeed, we can keep this part brief since the learning objects will naturally contain large amounts of STEX syntax that
(ideally) allow the model to pick up on the relevant macros and their usage. In particular, we can safely omit example
usages here.
We also make sure that suﬃcient information is provided in the prompt to ensure the reuse of symbols present
in the learning objects and the surrounding modules:

Every learning object will be prefixed with an id that indicates which file it
is from , in a format close to what is required in \ usemodule ...
10 Lohr et al.

Next, we explain precisely what kinds of questions we expect from the model, namely one of the three autograd-
able question types supported by STEX (multiple choice, single choice, or ﬁll-in-the-blanks). We then provide examples for
each of them, so the model “knows” to replicate the relevant STEX macros and environments. We make sure that the
examples are representative of the output we expect from the model, i.e., they should contain as many detailed anno-
tations as possible, be (in our estimation) interesting and didactically valuable, and contain good feedback, especially
for the wrong answer options. Therefore, we also point out that adding feedback is desired:

A quiz question can be either a multiple choice question , a single choice

question or a fill - in - the - blanks question . We can also provide feedback for
each answer option using the ‘ feedback ‘ key of the relevant macros . This
feedback is shown to the student after they have submitted their answer .

An example for a multiple choice question is the following ; note the semantic
markup :
‘‘‘
\ begin { sproblem }
\ usemodule [ smglom / sets ]{ mod ? bijective }
\ usemodule [ smglom / sets ]{ mod ? relation - composition }
\ usemodule [ smglom / arithmetics ]{ mod ? natarith }
\ objective { understand }{ bijective }
\ objective { understand }{ injective }
\ objective { understand }{ surjective }

Assume $ \ fun {f , g }\ Natura lNumbe rs \ Na t ur al N um b er s$ . Which of the following are

true ?
\ begin { mcb }
\ mcc [F , feedback ={ No , $f$ and $g$ are unrelated }]
{ If $f$ is \ sn { injective } , so is $g$ .}
\ mcc [F , feedback ={ No . since $f$ does not need to be \ sn { surjective } , the
\ sr { surjective }{ surjectivity } of $g$ is not enough to make the
\ sr { compose }{ composition } of $f$ and $g$ \ sn { surjective }.}]
{ If $f$ is \ sn { injective } and $g$ is \ sn { surjective } , then
$ \ compose {g , f } $ is \ sn { surjective }.}
\ mcc [F , feedback ={ No . Since $f$ need not be \ sn { surjective } , the
\ sn { composition } need not be \ sn { surjective } either .}]
{ If $f$ is \ sn { injective } and $g$ is \ sn { surjective } , then
$ \ compose {g , f } $ is \ sn { bijective }.}
\ mcc [ T ]{ If $f$ and $g$ are \ sn { injective } , so is $ \ compose {g , f } $ .}
\ mcc [ T ]{ If $f$ and $g$ are \ sn { surjective } , so is $ \ compose {g , f } $ .}
\ end { mcb }
\ end { sproblem }
‘‘‘
The question is answered correctly , if the student selects exactly the \ mcc
options marked with [ T ].
Lohr et al. 11

We delimit the STEX code example using the three backticks ‘‘‘, as is standard in, e.g., Markdown. In particular,
it is the same delimitation used by GPT (in particular ChatGPT) when producing code in its output.
Furthermore, we deliberately chose an example of a multiple choice question with more than one correct answer
– like (in our experience) human beings, GPT too seems to prefer having a single correct choice; in both cases likely
because it is easier to come up with questions where there is one correct answer. To direct the model to generate
more question types than just “single choice questions disguised as multiple choice”, we promote a variety of question
formats.
We subsequently provide similar examples for single choice and ﬁll-in-the-blank questions. Finally, we reiterate
the expected outcomes and add speciﬁc criteria we want the questions to satisfy, starting with ones that seemed
important to us from the start:

Once you are given the data described above , you are to reply with the given
number of quiz questions as LaTeX code with semantic markup .

Make sure that the questions you generate satisfy the following criteria :
- The answer to a question should give a good indication regarding the extent
to which the student has mastered the concept with respect to the given
cognitive dimension .
- The questions should be as diverse as possible , i . e . they should test for
different aspects of the concept .
- Make sure you provide good feedback , especially for wrong answers , so that
students can learn what they did wrong .
- Assume the students know nothing about the concept other than what ’ s in the
learning objects provided .
- In single / multiple choice questions , make sure that the distractors are
superficially plausible to students who have not yet mastered the subject ,
as to not make the problems too easy .
- Feel free to use examples the students might know from elsewhere , e . g . basic
high school level knowledge , or foundational principles of the study program
the course is in .

Finally, we list the parameters provided by the user as replacement variables that are substituted by the system:

%%%%
concepts : %% CONCEPT %%
course : %% COURSE %%
...
learning objects : %% L E A R N I N G _ O B J E C T S %%

Notable updates during iteration

While some aspects of the prompt discussed so far have changed during iterating, the “basic outline” of the prompt
seemed to work surprisingly well and has thus remained largely stable. In response to mistakes we noticed, we added
additional criteria to the list at the end, which we discuss now.
12 Lohr et al.

Occasionally, there was an excessive tendency in the output to refer to the course materials directly, resulting in
generated questions about e.g., speciﬁc names of variables in some example somewhere in the course notes. Hence,
we attempt to avoid questions in the output about irrelevant speciﬁcs of the course materials:

- We do not want students to rote - memorize definitions , examples , or other text

in the learning objects , so never make the correct answer dependent on such
details ( e . g . variable names , particular examples , etc .).

It also sometimes added feedback or explanations to the text of the answer choice, inevitably revealing the correct
and wrong answers in the questions itself:

- Do not put any text in the LaTeX code that directly states which answer is
correct - the sTeX macros used above take optional arguments explicitly for
that purpose .

In some instances, quiz questions were generated to ask for additional free-text answers, so we emphasize that
students will not be able to do so:

- Note that students are limited to replying to a question in the form the
question type is posed in , i . e . ticking boxes in single / multiple choice
questions , or filling in a short text in fill - in - the - blanks questions .
They can not provide any additional text .

The output rarely contained ﬁll-in-the-blank questions. When we modify the prompt to explicitly encourage
those, problems quickly emerge, which we attempted to ﬁx thusly:

- The correct answer must be unambiguous , particularly for

fill - in - the - blanks questions .
- Importantly , the evaluation of student ’ s answers in \ fillinsol is done
automatically via string matching , so \ fillinsol can only contain plain
text , no LaTeX code , and students need to type in the answer exactly to
get any points .

This seemed to help somewhat, but the primary noticeable result was even fewer fill-in-the-blank questions.
This should not be entirely surprising: beyond the remember dimension, fill-in-the-blank questions (where answers
are literally string matched to a reference solution) are primarily helpful to ask for an answer that (ideally) can only
result from, e.g., correctly applying an algorithm, computing some expression, counting specific properties, etc. – i.e.,
they are very useful to make questions asking for complex reasoning and processes autogradable. These questions tend
to be particularly difficult to design, though. Hence, the model strongly prefers multiple and single-choice questions.
In general, the generated questions for the apply dimension were not satisfactory. One strategy we tried to
improve this was to explicitly point out the use of fill-in-the-blank questions in the prompt, but this did not lead to
any noticeable improvement.
Lohr et al. 13

4.5 | Evaluation

In the literature, there are two established techniques to evaluate the quality of generated questions [24]: (1) auto-
matic evaluation and (2) human-based evaluation.

Certain criteria outlined in Section 3 necessitate speciﬁc insights from the course context, accessible only to
individuals within the lecture environment. Similarly, assessing the quality of STEX annotations requires expertise and
cannot be eﬀectively automated. Hence, an expert-based evaluation was selected – the most common evaluation
approach in the context of automatic question generation [24].

For this purpose, we (1) developed a structured survey to evaluate the quality of generated questions based on
the criteria above, (2) generated a collection of 30 questions, (3) conducted the survey with experts within the course
environment, and (4) conducted a qualitative analysis of the survey ﬁndings.

4.5.1 | Survey Design

The survey’s introductory segment supplied experts with context by outlining the parameters used in the question
generation prompt, followed by the complete presentation of the generated question (GQ), incorporating any semantic
annotations.

A free-text response field was incorporated to evaluate the accuracy of the content and detect any potential
errors in the GQ. This section allowed experts to pinpoint specific content discrepancies or inaccuracies within the
GQ. Additionally, we employed a 5-point Likert scale, spanning from “Very Difficult” to “Very Easy,” to assess the
perceived level of difficulty of the GQ.

The survey consisted of six statements to assess the generated question’s diﬀerent quality dimensions. Experts
were tasked with indicating their degree of agreement with these statements on a 7-point Likert scale, ranging from
“Strongly Disagree” to “Strongly Agree”. The statements were outlined as follows:

• The GQ has a good FIT in terms of teaching material.

• The GQ can be solved using the available teaching material.
• The task description of the GQ cannot be misinterpreted (is not ambiguous).
• The GQ is relevant for the achievement of the speciﬁed Learning Objective.
• The feedback provided for the answer options of the GQ is helpful.
• The structure of the task corresponds to the speciﬁed task format.

Upon completing the survey, the experts could expand their assessments by noting any additional irregularities
or remarks. This section, intentionally left open-ended, was aimed at allowing the experts to share extra observations,
recommendations, or issues they might have about the tasks presented.

4.5.2 | Parameter Selection

To evaluate GQs systematically across multiple topics, we decided on generating a total of 30 questions across six
distinct subjects within a university course “Artiﬁcial Intelligence I”, focusing on topics in symbolic AI with otherwise
ﬁxed parameters in the prompt (see Section 4.3).
14 Lohr et al.

Difficulty and Cognitive Dimension The majority of task generation systems documented in the literature focus
solely on generating factual questions that target the remember dimension [39]. However, since our main objective
in AGQ is to create learning materials capable of effectively enhancing our learner models, it is imperative to include
tasks that assess at least the cognitive level of understand. Thus, we have opted to confine our investigation in this
experiment to tasks that align with the understand dimension, assuming that the remaining dimensions are even
more challenging to design questions for in our context (as for humans, so for LLMs). We fixed the difficulty level
to be “medium” since we had already determined the model to largely generate relatively easy questions, which are
consequently less informative when assessing students’ mastery of a topic.

Number of Questions There is a trade-off regarding the number of questions generated per prompt at once: The
more questions generated, the more diverse they can be expected to be. In contrast, generating fewer per prompt
and prompting multiple times prevents immediate consideration of previously generated questions, making it difficult
to minimize overlap between questions. However, the model often refuses to generate too many questions at once;
it only replies with a few and instructs the user to prompt to yield more. The maximal number of GQs per prompt
seems to depend on the topic or, presumably, the amount of course materials added to the prompt, but up to five
questions at once seems to work consistently. We leave a more systematic investigation of this trade-off and the
resulting quality of sets of questions as a whole to future work and fixed the number to be five GQs per topic (for a
total of 30 GQs).

Topics We chose six topics from the course spread roughly uniformly across the table of contents of the course
notes, with an attempt of making some “broader” and others more “speciﬁc”; these being:

1. Arc Consistency (constraint satisfaction problems),

2. Alpha-Beta Search (game play)
3. Semantics of Propositional Logic
4. Syntax of First-Order Logic
5. The STRIPS model (planning)
6. The delete Relaxation (a heuristic in solving planning problems).

5 | RESULTS AND DISCUSSION

5.1 | Question Quality

The question types of the GQs were dominated by single choice questions (12 out of 30), followed by multiple choice
questions (18 out of 30). The first consistently had three possible answers, while the latter almost always offered four
options, usually with three distractors. The constant number of answers may be because the examples in the prompt
also contain precisely this number of answers. With the MCQ, explicit care was taken to ensure that the example
provided contained more than one correct answer. Nevertheless, the generated MCQs almost exclusively contained
exactly three distractors. Notably, no fill-in-the-blank (FIB) questions appear in the generated questions. This is likely
because designing meaningful FIB questions is significantly more challenging for the understand dimension, especially
considering the necessary constraint that the answer be unambiguous.
The FIT of the GQs to the provided teaching material was evaluated by the experts as consistently given (28 out
of 30), and most of them were assessed as solvable using the available teaching material (27 out of 30). In most cases,
Lohr et al. 15

the GQs were evaluated as clearly and unambiguously formulated. Finally, most experts rated the GQs as relevant
for achieving the learning objective.
However, the quality of the provided feedback was mixed. Frequently, no feedback was generated for an answer
option. When it was, it was often not helpful in that it merely rephrased a wrong answer again – i.e., the feedback to
an incorrect choice X was a variant of “No, it is not the case that X ”. This instance of uninformative feedback has also
been observed and reported in other studies on LLM-feedback (e.g., [40]).

5.2 | Content Errors

Eleven out of 30 GQs contained errors, mainly in the answer options and the feedback. These errors occurred particu-
larly frequently in the topics arc consistency and semantics of propositional logic. We can only speculate on the reasons
for this. Still, it seems likely that it correlates with the conceptual complexity of the topics or how well the topic is
represented in the model’s original training data – semantics requires a much better understanding of the meaning
of logical statements and the formal/mathematical mechanisms that provide statements with meaning (interpretation
functions, models and their formal deﬁnitions, etc), and there are multiple ways to formulate these mathematically. Sim-
ilarly, to understand arc consistency requires a solid intuition about how various constraints in a constraint satisfaction
problem interact with respect to the variables in a particular problem.
One crucial result to consider is that the model may produce superﬁcially plausible questions that align with the
“aesthetic” of a good question but are false in not necessarily trivial ways.
One particularly striking example is the following question on the topic of propositional logic (generated in an
earlier experiment),

If we have already established ¬B, how can we use the Natural Deduction rule for implication (⇒) elimination to infer
a new formula?
□
X Given B ⇒ C , deduce ¬C

where the answer above was considered the intended and only correct one. This is the common fallacy of denying the
antecedent, but without understanding the semantics of propositional logic (or the natural deduction calculus) already,
this is not immediately obvious.
Notable here is that the question is not only wrong, but that it actively reinforces a common misconception that
may not necessarily be caught even by, e.g., student employees who assume they are suﬃciently knowledgeable in
the domain to evaluate questions, which is arguably even more dangerous than a merely wrong question.

5.3 | Semantic Annotations

We can distinguish between two kinds of annotations in STEX: Structural annotations and relational annotations.
Structural annotations, such as \begin{sproblem}, \begin{mcb} (for MCQ-blocks),
\mcc[<T|F>,feedback={...}]{<answer>} (for an answer in a MCQ-block), \objective{<competency>}{<symbol>}
etc. should be present in all questions, or all questions of a particular question type, and are very schematic in the
sense that they are not particular to a speciﬁc question, topic or concept (other than the one explicitly provided as a
parameter to the prompt). These annotations seemingly pose no signiﬁcant challenge to the model – the few examples
in the prompt are enough for the model to pick up on them immediately and use them consistently and correctly in
(almost) all generated questions. Rarely does it hallucinate plausible but non-existent symbol names in the objectives.
16 Lohr et al.

This is not the case for relational annotations, such as symbol or module references. To annotate a piece of text
with a specific symbol, one has to know the module it is declared in, import it (using \usemodule{...}), and then
annotate the text using the name of the symbol (e.g. “$a$ \sr{addition}{plus} $b$” annotates the word plus
with the symbol named “addition” which is currently in scope – globally, there may be multiple symbols with the same
name). Naturally, how to do such relational annotations can not be directly inferred from a few generic examples.
Instead, humans and AI models need a way to systematically look up symbols and their containing modules if we want
to be able to refer to them in annotations.
As mentioned in Section 4.1, we use these annotations to determine the prerequisites of learning objects, in-
cluding questions, so ideally, we would want the model to annotate as much as possible. Our initial hope that the
provided learning objects – which should have ample occurrences of the relevant symbol names and module imports
– would be enough for the model to “learn” to use them correctly turned out to be clearly false.4 Additional explicit
instructions in the prompt also did not improve matters.
We therefore attempted to use the canonical way of giving models the ability to look up additional information:
function calling and retrieval-augmented generation. OpenAI’s API for GPT-4 offers explicit support for providing the
model with a set of functions and determining which parameters they need. When prompted, the model will then
either return an output as usual, or a JSON object specifying (1) which function the model “wants” to call, and (2) the
parameters the call should use. It is then up to the caller of the API to implement functionality that augments the
following prompt with the correct return value for that function call.
We allowed the model to call a search function with a sequence of arbitrary strings as parameters. We augmented
the prompt by adding a paragraph that instructed the model to use this function to determine how to semantically
annotate any text that refers to a concept from the domain of the course. We then used existing functionality to
search for any definitions in our corpora and picked the top ten results for each of the string parameters. The search
results were subsequently appended to the prompt in a new delimited section, the instruction to call the function was
removed (to avoid potentially many subsequent calls or infinite loops), and the resulting prompt was fed back to the
model.
This, too, unfortunately, did not have the desired effect; on the contrary. The quality of the questions generated
got noticeably worse without improving the quality of the semantic annotations. We conjecture that by appending
the search result and making the total prompt significantly longer, individual parts of the prompt have less influence
on the output, e.g. the instructions at the beginning of the prompt. It is possible that more experimentation, possibly
including repeating the instructions multiple times throughout the prompt, can lead to better results in the long run.5

5.4 | Limitations
We used a single model (GPT-4-turbo) to generate the questions for the evaluation. At the time of the study (March
2024), this model was – in our opinion – the most suitable model (see Section 4.2 for a detailed explanation). However,
we explicitly do not rule out the possibility that there will be ongoing “better” models that are more suitable and can
lead to better results. A further limitation of this study is that we did not involve students in evaluating the GQ. To
obtain objective evidence of the quality of the tasks, a survey of students on the tasks would be helpful and is planned
for the future. Especially when evaluating multiple choice questions, for example, an assessment by (only) experts is
not suﬃcient. Whether a distractor is good or bad usually only becomes apparent when the exercise is presented to

4 One reason for that could be the unfortunately inconsistent usage of annotations in the course materials in general, improvements of which
happen frequently but are far from exhaustive.
5 According to anecdotal “LLM-folklore”, repeating important instructions in a long prompt is supposed to improve results.
Lohr et al. 17

students (who try to solve it). If learners never choose the distractor, it is a good indicator that it is too apparent and,
therefore, unsuitable. At the time of evaluation, however, there was no possibility of recruiting suitable students from
the course. Additionally, most generated questions were not “ready to use” on students without passing a “human
filter”. This means that we would no longer have evaluated the LLM’s capabilities, but the experts’ ability to filter the
generated questions sensibly.
Furthermore, the study did not include any control items. As we did not aim to find out whether the experts were
able to distinguish between LLM-generated and human-generated tasks the control items were of less importance.
Their only function would have been to normalize the expert’s rating, which was not necessary for our research
questions.

6 | CONCLUSION

In this experiment, we investigated the extent to which large language models are suitable for generating semantically
annotated quiz questions for higher education from semantically annotated course materials that are (1) suitable for
a specific course and (2) address understanding a concept.
Our research reveals significant limitations in the application of LLMs for this task. While generating questions that
address remembering factual knowledge works well, creating questions that address understanding remains a notable
challenge – which would need questions aiming for conceptual knowledge, or asking for explanations rather than recall
of a simple concept. Despite LLMs’ ability to generate a range of questions, domain experts’ need for extensive filtering
underscores the models’ inadequacies in autonomously generating educationally valuable content. The quality of
questions generated by LLMs often does not meet educational standards, primarily due to the simplistic nature of
the questions and the lack of valuable (and correct) feedback. Thus, the required careful review process by experts is
both time-consuming and counter to the goal of automated content creation. In addition, errors in the content of the
questions are challenging to detect for non-experts, as semantically incorrect output from LLMs is known to be very
close to the “truth” [41].
Our results show that generating structural, semantic annotations works well. However, this success was not
reflected in the case of relational annotations, which exhibited poor integration, indicating a limitation in the LLM’s
“ability” to contextualize and link concepts within the generated content effectively.
The results of the present work make clear that while LLMs can contribute to the pool of learning materials,
their current state requires significant human intervention to refine and validate the generated content. So, despite
the promises systems using LLMs give, the human-in-the-loop remains crucial. Although the prompt was carefully
designed and we provided a lot of static and dynamic context, the results still show weaknesses in quality and semantic
annotation. Nevertheless, further studies with a high number of generated tasks and more experts might show that
the automatic generation of questions following our approach could lead to more efficiency when used as an assistive
system for the preparation process.

Acknowledgements
The work reported in this article was conducted as part of the VoLL-KI (see https://voll-ki.de) funded by the
German Research/Education Ministry under grant 16DHBKI089. We also would like to thank Rakesh Kumar for the
work in developing a signiﬁcant portion of the front end for the question generator.
18 Lohr et al.

references
[1] Hattie JAC. Visible learning: a synthesis of over 800 meta-analyses relating to achievement. New York: Routledge;
2009.

[2] Kirby JR, Lawson MJ, editors. Enhancing the Quality of Learning: Dispositions, Instruction, and Learning Processes.
Cambridge: Cambridge University Press; 2012.

[3] Sweller J. Cognitive Load During Problem Solving: Eﬀects on Learning. Cognitive Science 1988;12(2):257–285.

[4] Sweller J, van Merrienboer JG, Paas FWC. Cognitive Architecture and Instructional Design. Educational Psychology
Review 1998;10(3):251–296.

[5] Bloom BS. Learning of mastery. Evaluation Comment 1968;1(2):1–12.

[6] Keller FS. "Good-bye, teacher...". Journal of applied behavior analysis 1968;1(1):79–89.

[7] Kruse T, Berges M, Betzendahl J, Kohlhase M, Lohr D, Müller D. Learning with ALeA: Tailored Experiences through
Annotated Course Material. [object Object]; 2023.

[8] Wolfe JH. Automatic Question Generation from Text - an Aid to Independent Study. In: Proceedings of the ACM
SIGCSE-SIGCUE Technical Symposium on Computer Science and Education - Not Known: ACM Press; 1976. p. 104–
112.

[9] Denny P, Khosravi H, Hellas A, Leinonen J, Sarsa S. Can We Trust AI-Generated Educational Content? Comparative
Analysis of Human and AI-Generated Learning Resources. eprint 2023;.

[10] Song T, Tian Q, Xiao Y, Liu S. Automatic Generation of Multiple-Choice Questions for CS0 and CS1 Curricula Using
Large Language Models. In: Hong W, Kanaparan G, editors. Computer Science and Education. Computer Science and
Technology, vol. 2023 Singapore: Springer Nature Singapore; 2024.p. 314–324.

[11] Sarsa S, Denny P, Hellas A, Leinonen J. Automatic Generation of Programming Exercises and Code Explanations Using
Large Language Models. In: Proceedings of the 2022 ACM Conference on International Computing Education Research
- Volume 1 Lugano and Virtual Event Switzerland: ACM; 2022. p. 27–43.

[12] Kurdi G, Leo J, Parsia B, Sattler U, Al-Emari S. A Systematic Review of Automatic Question Generation for Educational
Purposes. International Journal of Artiﬁcial Intelligence in Education 2020 Mar;30(1):121–204.

[13] Prather J, Denny P, Leinonen J, Becker BA, Albluwi I, Craig M, et al. The Robots Are Here: Navigating the Generative AI
Revolution in Computing Education. In: Proceedings of the 2023 Working Group Reports on Innovation and Technology
in Computer Science Education Turku Finland: ACM; 2023. p. 108–159.

[14] Bhat S, Nguyen H, Moore S, Stamper J, Sakr M, Nyberg E. Towards Automated Generation and Evaluation of Questions
in Educational Domains. In: Proceedings of the 15th International Conference on Educational Data Mining International
Educational Data Mining Society; 2022. p. 701–704.

[15] Wang Z, Valdez J, Basu Mallick D, Baraniuk RG. Towards Human-Like Educational Question Generation with Large
Language Models. In: Rodrigo MM, Matsuda N, Cristea AI, Dimitrova V, editors. Artiﬁcial Intelligence in Education, vol.
13355 Cham: Springer International Publishing; 2022.p. 153–166.

[16] Wong LH, Park H, Looi CK. From Hype to Insight: Exploring ChatGPT ’s Early Footprint in Education via Altmetrics and
Bibliometrics. Journal of Computer Assisted Learning 2024 Feb;p. jcal.12962.

[17] Yan L, Sha L, Zhao L, Li Y, Martinez-Maldonado R, Chen G, et al. Practical and Ethical Challenges of Large Language
Models in Education: A Systematic Scoping Review. British Journal of Educational Technology 2024 Jan;55(1):90–112.
Lohr et al. 19

[18] McNichols H, Feng W, Lee J, Scarlatos A, Smith D, Woodhead S, et al. Exploring Automated Distractor and Feedback
Generation for Math Multiple-choice Questions via In-context Learning. eprint 2023;.

[19] Dijkstra R, Genç Z, Kayal S, Kamps J. Reading Comprehension Quiz Generation Using Generative Pre-trained Transform-
ers. In: Proceedings of the Fourth International Workshop on Intelligent Textbooks 2022, vol. 4 Durham, UK: Aachen:
CEUR-WS; 2022. p. 4–17.

[20] Tran A, Angelikas K, Rama E, Okechukwu C, Smith DH, MacNeil S. Generating Multiple Choice Questions for Computing
Courses Using Large Language Models. In: 2023 IEEE Frontiers in Education Conference (FIE) College Station, TX, USA:
IEEE; 2023. p. 1–8.

[21] Anderson LW, Krathwohl DR, editors. A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxon-
omy of Educational Objectives. Complete ed ed. New York: Longman; 2001.

[22] Lohr D, Berges M, Kohlhase M, Müller D, Rapp M. The Y-Model - Formalization of Computer Science Tasks in the
Context of Adaptive Learning Systems. In: 2023 IEEE 2nd German Education Conference (GECon) Berlin, Germany:
IEEE; 2023. p. 1–6.

[23] Ch DR, Saha SK. Automatic Multiple Choice Question Generation From Text: A Survey. IEEE Transactions on Learning
Technologies 2020 Jan;13(1):14–25.

[24] Mulla N, Gharpure P. Automatic Question Generation: A Review of Methodologies, Datasets, Evaluation Metrics, and
Applications. Progress in Artiﬁcial Intelligence 2023 Mar;12(1):1–32.

[25] Hattie J, Timperley H. The Power of Feedback. Review of Educational Research 2007 Mar;77(1):81–112.

[26] Kiesler N, Lohr D, Keuning H. Exploring the Potential of Large Language Models to Generate Formative Programming
Feedback. In: 2023 IEEE Frontiers in Education Conference (FIE) College Station, TX, USA: IEEE; 2023. p. 1–5.

[27] Lohr D, Kiesler N, Keuning H, Jeuring J. “Let Them Try to Figure It Out First” - Reasons Why Experts (Do Not) Pro-
vide Feedback to Novice Programmers. In: Proceedings of the 2024 Innovation and Technology in Computer Science
Education (ITiCSE 2024), vol. 1 Milan, Italy: ACM; 2024. p. 7.

[28] Hellas A, Leinonen J, Sarsa S, Koutcheme C, Kujanpää L, Sorva J. Exploring the Responses of Large Language Models
to Beginner Programmers’ Help Requests. In: Proceedings of the 2023 ACM Conference on International Computing
Education Research V.1 Chicago IL USA: ACM; 2023. p. 93–105.

[29] Balse R, Valaboju B, Singhal S, Warriem JM, Prasad P. Investigating the Potential of GPT-3 in Providing Feedback for
Programming Assessments. In: Proceedings of the 2023 Conference on Innovation and Technology in Computer Science
Education V. 1 Turku Finland: ACM; 2023. p. 292–298.

[30] Matelsky JK, Parodi F, Liu T, Lange RD, Kording KP. A Large Language Model-Assisted Education Tool to Provide Feed-
back on Open-Ended Responses. eprint 2023;.

[31] Kohlhase M, Müller D. System Description: sTeX3 – A LATEX-based Ecosystem for Semantic/Active Mathematical Docu-
ments. In: Buzzard K, Kutsia T, editors. Intelligent Computer Mathematics (CICM) 2022, vol. 13467 of LNAI Springer;
2022. p. 184–188. https://kwarc.info/people/dmueller/pubs/cicm22stexsd.pdf.

[32] Kohlhase M. OMDoc – An open markup format for mathematical documents [Version 1.2]. No. 4180 in LNAI, Springer
Verlag; 2006. http://omdoc.org/pubs/omdoc1.2.pdf.

[33] Kohlhase M, Müller D, The sTeX3 Manual;. https://github.com/slatex/sTeX/blob/main/doc/stex-manual.pdf.

[34] Müller D. An HTML/CSS schema for TEX primitives – generating high-quality responsive HTML from generic TEX. In:
TUGboat; TUG 2023 Conference Proceedings, vol. 44; 2023. p. 275–286. https://kwarc.info/people/dmueller/pubs/
tug23.pdf.
20 Lohr et al.

[35] OpenAI, GPT-4 Technical Report. arXiv; 2023. https://arxiv.org/abs/2303.08774.

[36] Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al., LLaMA: Open and Eﬃcient Foundation Language
Models; 2023.

[37] Zamﬁrescu-Pereira JD, Wong RY, Hartmann B, Yang Q. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail)
to Design LLM Prompts. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems CHI
’23, New York, NY, USA: Association for Computing Machinery; 2023. https://doi.org/10.1145/3544548.3581388.

[38] OpenAI, OpenAI API Documentation: Prompt Engineering; 2024. https://platform.openai.com/docs/guides/prompt-

engineering.

[39] Kumar AP, Nayak A, K MS, Chaitanya, Ghosh K. A Novel Framework for the Generation of Multiple Choice Question
Stems Using Semantic and Machine-Learning Techniques. International Journal of Artiﬁcial Intelligence in Education
2023 Mar;.

[40] Lohr D, Keuning H, Kiesler N. You’re (Not) My Type – Can LLMs Generate Feedback of Speciﬁc Types for Introductory
Programming Tasks? Journal of Computer Assisted Learning (JCAL) 2024;Accepted.

[41] Sobieszek A, Price T. Playing Games with Ais: The Limits of GPT-3 and Similar Large Language Models. Minds and
Machines 2022 Jun;32(2):341–364.

The Future of Learning in The Age of Generative AI
No ratings yet
The Future of Learning in The Age of Generative AI
13 pages
Large Language Models For Education: A Survey: Xu, Gan, Qi, Wu and Yu
No ratings yet
Large Language Models For Education: A Survey: Xu, Gan, Qi, Wu and Yu
19 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
Gamification
No ratings yet
Gamification
11 pages
Cain 2024 Prompting Change Exploring Prompt e
No ratings yet
Cain 2024 Prompting Change Exploring Prompt e
11 pages
Exploring The Potential of Using ChatGPT in Physics Education
No ratings yet
Exploring The Potential of Using ChatGPT in Physics Education
19 pages
Comparing Text Augmentation by GPT-3.5 and Llama3 For Evaluating Student Responses
No ratings yet
Comparing Text Augmentation by GPT-3.5 and Llama3 For Evaluating Student Responses
27 pages
CodeHelp - Using Large Language Models With Guardrails For Scalable Support in Programming Classes
No ratings yet
CodeHelp - Using Large Language Models With Guardrails For Scalable Support in Programming Classes
11 pages
Softcot: Soft Chain-Of-Thought For Efficient Reasoning With Llms
No ratings yet
Softcot: Soft Chain-Of-Thought For Efficient Reasoning With Llms
13 pages
Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
No ratings yet
Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
11 pages
Natural Learning
No ratings yet
Natural Learning
35 pages
Llms
No ratings yet
Llms
17 pages
Courseassist: Pedagogically Appropriate Ai Tutor For Computer Science Education
No ratings yet
Courseassist: Pedagogically Appropriate Ai Tutor For Computer Science Education
3 pages
A Survey of Current Machine Learning Approaches To Student Free-Text Evaluation For Intelligent Tutoring - 2022
No ratings yet
A Survey of Current Machine Learning Approaches To Student Free-Text Evaluation For Intelligent Tutoring - 2022
39 pages
L L M A R - : Arge Anguage Odels As Nalogical Eason ERS
No ratings yet
L L M A R - : Arge Anguage Odels As Nalogical Eason ERS
24 pages
New Trends in Applied Machine Intelligence
No ratings yet
New Trends in Applied Machine Intelligence
17 pages
Major Projectpp
No ratings yet
Major Projectpp
5 pages
AI Literacy and Its Implications For Prompt Engineering Strategies
No ratings yet
AI Literacy and Its Implications For Prompt Engineering Strategies
16 pages
24 - VNHSGE - VietNamese High School Graduation Examination Dataset For Large Language Models
No ratings yet
24 - VNHSGE - VietNamese High School Graduation Examination Dataset For Large Language Models
74 pages
pdf2306 08997 PDF
No ratings yet
pdf2306 08997 PDF
20 pages
Major Projectpp
No ratings yet
Major Projectpp
6 pages
Author Name Title Paper/Submission ID Submitted by Submission Date Total Pages Document Type
No ratings yet
Author Name Title Paper/Submission ID Submitted by Submission Date Total Pages Document Type
8 pages
Adapting Large Language Models For Education: Foundational Capabilities, Potentials, and Challenges
No ratings yet
Adapting Large Language Models For Education: Foundational Capabilities, Potentials, and Challenges
13 pages
Large Language Models For Software Engineering
No ratings yet
Large Language Models For Software Engineering
79 pages
DRAFT
No ratings yet
DRAFT
6 pages
Fire System
No ratings yet
Fire System
3 pages
My Library - CSV
No ratings yet
My Library - CSV
10 pages
The Impact of Structured Prompt-Driven Generative AI On Learning Data Analysis in Engineering Students
No ratings yet
The Impact of Structured Prompt-Driven Generative AI On Learning Data Analysis in Engineering Students
8 pages
Case Study For Procurement
No ratings yet
Case Study For Procurement
62 pages
Yang 2018 Icwsm
No ratings yet
Yang 2018 Icwsm
10 pages
Appendix 3 CIC BIM User Guide For Preparation of Statutory Plan Submissions Revit Dec2020
No ratings yet
Appendix 3 CIC BIM User Guide For Preparation of Statutory Plan Submissions Revit Dec2020
160 pages
Enhancing Educational Qa Systems: Integrating Knowledge Graphs and Large Language Models For Context-Aware Learning
No ratings yet
Enhancing Educational Qa Systems: Integrating Knowledge Graphs and Large Language Models For Context-Aware Learning
9 pages
Comparison of Large Language Models For Generating Contextually Relevant Questions
No ratings yet
Comparison of Large Language Models For Generating Contextually Relevant Questions
6 pages
E Nhancing e Ducational Qa S Ystems I Ntegrating K Nowledge G Raphs A ND L Arge L Anguage M Odels F or C Ontext A Ware L Earning
No ratings yet
E Nhancing e Ducational Qa S Ystems I Ntegrating K Nowledge G Raphs A ND L Arge L Anguage M Odels F or C Ontext A Ware L Earning
9 pages
AI-Tutoring in Software Engineering Education: Eduard Frankford Clemens Sauerwein Patrick Bassner
No ratings yet
AI-Tutoring in Software Engineering Education: Eduard Frankford Clemens Sauerwein Patrick Bassner
11 pages
Ingram 2023 Prompting Large Language Models Power Educational Chatbots
No ratings yet
Ingram 2023 Prompting Large Language Models Power Educational Chatbots
20 pages
Machine Learning @quizizz
No ratings yet
Machine Learning @quizizz
2 pages
14987-Article Text-18506-1-2-20201228
No ratings yet
14987-Article Text-18506-1-2-20201228
10 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
Physics On Autopilot Exploring The Use of An AI As
No ratings yet
Physics On Autopilot Exploring The Use of An AI As
20 pages
Dami Reference
No ratings yet
Dami Reference
7 pages
Applsci 14 09125
No ratings yet
Applsci 14 09125
19 pages
2023 BDCC - Short Ans. Grading
No ratings yet
2023 BDCC - Short Ans. Grading
14 pages
Impact Robotic
No ratings yet
Impact Robotic
21 pages
20i0551 - Muhammad Usman - Project-Proposal
No ratings yet
20i0551 - Muhammad Usman - Project-Proposal
3 pages
FutureOfLearning LLMs Book Chapter
No ratings yet
FutureOfLearning LLMs Book Chapter
12 pages
E Nhancing e Ducational Qa S Ystems I Ntegrating K Nowledge G Raphs A ND L Arge L Anguage M Odels F or C Ontext A Ware L Earning
No ratings yet
E Nhancing e Ducational Qa S Ystems I Ntegrating K Nowledge G Raphs A ND L Arge L Anguage M Odels F or C Ontext A Ware L Earning
9 pages
22 - Can We Trust Ai-Generated Educational Content
No ratings yet
22 - Can We Trust Ai-Generated Educational Content
15 pages
Beyond Traditional Teaching
No ratings yet
Beyond Traditional Teaching
43 pages
2023 LLM4 Edu
No ratings yet
2023 LLM4 Edu
11 pages
z-s2.0-S2666920X23000516-main (Bernabei, 2023)
No ratings yet
z-s2.0-S2666920X23000516-main (Bernabei, 2023)
18 pages
10.48550 Arxiv.2204.02311
No ratings yet
10.48550 Arxiv.2204.02311
87 pages
Adapting Large Language Models For Education: Foundational Capabilities, Potentials, and Challenges
No ratings yet
Adapting Large Language Models For Education: Foundational Capabilities, Potentials, and Challenges
31 pages
Perspective Large Languagemodels in Applied Mechanics
No ratings yet
Perspective Large Languagemodels in Applied Mechanics
7 pages
1 s2.0 S2666920X24000262 Main
No ratings yet
1 s2.0 S2666920X24000262 Main
14 pages
On The Application of Large Language Models For Language Teaching and Assessment Technology
No ratings yet
On The Application of Large Language Models For Language Teaching and Assessment Technology
25 pages
Escholarship UC Item 6kf0r28s
No ratings yet
Escholarship UC Item 6kf0r28s
45 pages
From GPT To BERT:: Benchmarking Large Language Models For Automated Iz Generation
No ratings yet
From GPT To BERT:: Benchmarking Large Language Models For Automated Iz Generation
2 pages
Practical and Ethical Challenges of Large Language Models in Education: A Systematic Scoping Review
No ratings yet
Practical and Ethical Challenges of Large Language Models in Education: A Systematic Scoping Review
23 pages
1 s2.0 S1041608023000195 Main
No ratings yet
1 s2.0 S1041608023000195 Main
9 pages
BAS Workbook
No ratings yet
BAS Workbook
103 pages
Harrison Githinji - Data Annotator - Resume
No ratings yet
Harrison Githinji - Data Annotator - Resume
2 pages
Sample Annotated Bibliography Thesis
100% (3)
Sample Annotated Bibliography Thesis
4 pages
PrePoMax-v2 0 0-Manual
No ratings yet
PrePoMax-v2 0 0-Manual
52 pages
Pizarro 等 - 2022 - Automatic floor plan analysis and recognition
No ratings yet
Pizarro 等 - 2022 - Automatic floor plan analysis and recognition
21 pages
How To Write Bibliography in Thesis Writing
100% (1)
How To Write Bibliography in Thesis Writing
4 pages
CDS Annotations
No ratings yet
CDS Annotations
118 pages
The Penn Discourse Treebank 20 Annotation Manual
No ratings yet
The Penn Discourse Treebank 20 Annotation Manual
107 pages
Handout 10904 Ellis APractical Guideto GISin Civil 3 D
No ratings yet
Handout 10904 Ellis APractical Guideto GISin Civil 3 D
37 pages
Full
No ratings yet
Full
350 pages
Graphically Speaking Unmasking Abuse in Social Med
No ratings yet
Graphically Speaking Unmasking Abuse in Social Med
14 pages
Obj 12 Mov 2-Teaching Materials With Annotation
No ratings yet
Obj 12 Mov 2-Teaching Materials With Annotation
2 pages
A2 English Language Coursework Bibliography
100% (2)
A2 English Language Coursework Bibliography
8 pages
Productivity Tools For Anz
No ratings yet
Productivity Tools For Anz
125 pages
Fujifilmdrypix 4000 Dicom
No ratings yet
Fujifilmdrypix 4000 Dicom
64 pages
Embracing Imperfect Datasets: A Review of Deep Learning Solutions For Medical Image Segmentation
No ratings yet
Embracing Imperfect Datasets: A Review of Deep Learning Solutions For Medical Image Segmentation
34 pages
SP3D - Drawings and Reports - S5 - Tutorial
No ratings yet
SP3D - Drawings and Reports - S5 - Tutorial
13 pages
SEQUOIA VA50B InternalSummary HOOD05162003365811
No ratings yet
SEQUOIA VA50B InternalSummary HOOD05162003365811
7 pages
Bibliography Example Thesis
100% (3)
Bibliography Example Thesis
5 pages
Market Review2025
No ratings yet
Market Review2025
7 pages
70 Ways Sds Enhances Autocad Electrical For Substation PC Design
No ratings yet
70 Ways Sds Enhances Autocad Electrical For Substation PC Design
57 pages
Tagging
No ratings yet
Tagging
17 pages
SWENER-1800 A Corpus For Named Entity Recognition
No ratings yet
SWENER-1800 A Corpus For Named Entity Recognition
14 pages
Issues BB Ultra
No ratings yet
Issues BB Ultra
11 pages
Samll Text Active Learning
No ratings yet
Samll Text Active Learning
12 pages
Figurative Language Euphoria LYRICS Lesson Plan
No ratings yet
Figurative Language Euphoria LYRICS Lesson Plan
3 pages
ShopNet Product Image Annotation Guidelines 11.16.21
No ratings yet
ShopNet Product Image Annotation Guidelines 11.16.21
12 pages
Rhetoric in Everyday Life Assignment Sheet
No ratings yet
Rhetoric in Everyday Life Assignment Sheet
5 pages
Actual Fucking Annoying Site
No ratings yet
Actual Fucking Annoying Site
3 pages
Transforming Education with AI: Guide to Understanding and Using ChatGPT in the Classroom
From Everand
Transforming Education with AI: Guide to Understanding and Using ChatGPT in the Classroom
Shane Snipes, PhD
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Leveraging Large Language Models To Generate Course-Specific Semantically Annotated Learning Objects

Uploaded by

Leveraging Large Language Models To Generate Course-Specific Semantically Annotated Learning Objects

Uploaded by

ORIGINAL ARTICLE

Leveraging Large Language Models to Generate

Course-speciﬁc Semantically Annotated Learning

Dominic Lohr1† | Marc Berges1† | Abhishek Chugh3 |

1 Professorship for Computer Science

tervention to reﬁne and validate the generated content.

3 | REQUIREMENTS FOR THE GENERATED QUESTIONS

1. The concepts the question is intended to test,

FIGURE 1 Source Code and Representation in the system

• correct technical language,

4.1 | Semantic Annotations in STEX

4.2 | Model Selection

4.3 | Overview of the Generation Pipeline

4.4 | Prompt Design

1. Be speciﬁc in what the output should look like,

Your task is to generate quiz questions that evaluate the competency of

A quiz question can be either a multiple choice question , a single choice

Assume $ \ fun {f , g }\ Natura lNumbe rs \ Na t ur al N um b er s$ . Which of the following are

Notable updates during iteration

- We do not want students to rote - memorize definitions , examples , or other text

- The correct answer must be unambiguous , particularly for

4.5.1 | Survey Design

• The GQ has a good FIT in terms of teaching material.

4.5.2 | Parameter Selection

1. Arc Consistency (constraint satisfaction problems),

5 | RESULTS AND DISCUSSION

5.1 | Question Quality

5.2 | Content Errors

5.3 | Semantic Annotations

[5] Bloom BS. Learning of mastery. Evaluation Comment 1968;1(2):1–12.

[33] Kohlhase M, Müller D, The sTeX3 Manual;. https://github.com/slatex/sTeX/blob/main/doc/stex-manual.pdf.

[35] OpenAI, GPT-4 Technical Report. arXiv; 2023. https://arxiv.org/abs/2303.08774.

[38] OpenAI, OpenAI API Documentation: Prompt Engineering; 2024. https://platform.openai.com/docs/guides/prompt-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.