Full Text
Full Text
Abstract
Generative AI and large language models hold great promise in enhancing programming
education by automatically generating individualized feedback for students. We investigate
the role of generative AI models in providing human tutor-style programming hints to help
students resolve errors in their buggy programs. Recent works have benchmarked state-of-
the-art models for various feedback generation scenarios; however, their overall quality is still
inferior to human tutors and not yet ready for real-world deployment. In this paper, we seek
to push the limits of generative AI models toward providing high-quality programming hints
and develop a novel technique, GPT4Hints-GPT3.5Val. As a first step, our technique
leverages GPT-4 as a “tutor” model to generate hints – it boosts the generative quality by
using symbolic information of failing test cases and fixes in prompts. As a next step, our
technique leverages GPT-3.5, a weaker model, as a “student” model to further validate the
hint quality – it performs an automatic quality validation by simulating the potential utility
of providing this feedback. We show the efficacy of our technique via extensive evaluation
using three real-world datasets of Python programs covering a variety of concepts ranging
from basic algorithms to regular expressions and data analysis using pandas library.
1 Introduction
Generative AI and large language models (LLMs) have the potential to drastically improve the landscape
of computing and programming education by powering next-generation educational technologies. This
potential lies in the advanced capabilities of state-of-the-art models—like OpenAI’s GPT-4 [1] and Chat-
GPT (based on GPT-3.5) [2]—to automatically generate high-quality personalized content and feedback
∗ These authors are listed in alphabetical order. Correspondence to: Adish Singla <adishs@mpi-sws.org>.
1
for students [3–5]. A series of recent works have already shown us sparks of their capabilities for various
programming education scenarios, including generating new programming assignments [6, 7], providing
code explanations [6, 8], repairing buggy programs [9, 10], enhancing programming-error-messages [10, 11],
and acting as pair programmer [12, 13]. In this paper, we investigate the role of LLMs in providing human
tutor-style programming hints to help students resolve errors in their buggy programs. More concretely,
given a programming task and a student’s buggy program, we want to generate natural language hints to
help the student resolve bug(s) and make progress, inspired by how a human tutor would give pedagogical
feedback. With the current scale of enrollments in introductory programming courses [14], it has become
infeasible for human tutors to promptly provide individualized feedback to students, thereby motivating
the need to develop automatic feedback generation techniques. To this end, we aim to leverage generative
AI and LLMs for automating human tutor-style programming feedback to support students’ learning and
reduce human tutors’ workload. Recent works have studied state-of-the-art LLMs for generating various
forms of programming feedback for students, including detailed explanations about bugs or single-sentence
hints [4, 10, 11]. Despite promising initial results, the overall quality of feedback generated by LLMs is
substantially inferior to that of human tutors and not yet ready for deployment in real-life classroom settings.
For instance, a recent benchmark study in [4] evaluated GPT-4 in generating hints for buggy programs on
introductory Python programming tasks and assessed its quality performance using expert annotations –
GPT-4’s performance in terms of hints quality is only about 60% in contrast to human tutors’s performance
of over 90%. This performance gap between GPT-4 vs. human tutors can be attributed to several factors, as
discussed next. First, state-of-the-art models still struggle with symbolic reasoning and program execution
abilities crucial for understanding the underlying bugs and possible student misconceptions [3–5, 15].
Second, these models also suffer from hallucination issues and the generated feedback text—even though
seemingly plausible—may contain inaccurate information that could have detrimental effects on students’
learning [15–17]. Third, these models still lack a calibration mechanism to decide whether the generated
content is of high quality or not [10]; in particular, they are unable to do a human tutor-style reasoning
from a student’s perspective and judge if the generated feedback would likely help the student.
In this paper, we seek to push the limits of generative AI and state-of-the-art LLMs toward providing
high-quality programming hints. Given a base model, this would require improving the model’s abilities
at input-level by developing better prompting strategies [18], at output-level by developing mechanisms
to validate the generated content [10, 19, 20], or at model-level itself by fine-tuning (when considering
open-source models [21]). In our work, we consider OpenAI’s GPT-4 [1] as the base model—the latest model
presumably with over a trillion parameters—as it has shown to drastically improve existing models across
various programming education scenarios [4]. We develop a novel technique, GPT4Hints-GPT3.5Val, to
provide human tutor-style high-quality programming hints. Our technique leverages the GPT-4 model in
the role of a “tutor” to generate hints and boosts the generative quality at the input level by prompting it
with symbolic information of failing test cases and fixed programs. At the output level, it further validates
the hint quality by leveraging the GPT-3.5 model as a “student” to simulate the potential utility of
providing this feedback to human students. This validation step is designed to provide a quality assurance
layer and decides whether the generated feedback should be provided to the human student or not –
thereby trading off coverage (how many students are given automatic feedback) and precision (quality
of the given feedback). We show the efficacy of our technique by conducting an extensive evaluation
using three real-world datasets of Python programs covering a variety of concepts ranging from writing
basic algorithms to regular expressions and data analysis using pandas [22]. Figures 1 and 2 showcase
GPT4Hints-GPT3.5Val on two different buggy programs.1 More broadly, our work makes the following
contributions in leveraging generative AI and LLMs for computing and programming education:
1 When presenting these illustrative examples in this paper, we slightly obfuscate the students’ buggy programs to avoid
showing exact real-world programs. We do so by altering variable names and formatting conventions while preserving the
original bugs exactly the same, as considered in related works [4, 10]. Accordingly, if needed, we apply the same adjustments
to the generated output to maintain consistency with these alterations.
2
Given a string S, check if it is palindrome or not. Expected Time Complexity: O(|S|).
Constraints: 1 ≤ |S| ≤ 2 ∗ 105 .
(a) Description of the programming task
1 # User function Template for python3 1 # User function Template for python3 For Input: hq
2 class Solution : 2 class Solution :
3 def reverse ( self , b , e , S ) : 3 def reverse ( self , b , e , S ) : Your Code’s
4 while b < e : 4 while b < e :
5 S[b], S[e] = S[e], S[b] 5 S[b], S[e] = S[e], S[b] output is: 1
6 e = e - 1 6 e = e - 1 It’s Correct
7 b = b + 1 7 b = b + 1
8 return S 8 return S output is: 0
9 9
10 def isPalindrome ( self , S ) : 10 def isPalindrome ( self , S ) :
11 S = list ( S ) 11 S = list ( S )
12 beg = 0 12 beg = 0
13 end = len ( S ) - 1 13 end = len ( S ) - 1
14 rev = self . reverse ( beg , end , S ) 14 rev = self . reverse ( beg , end , S . copy () )
15 if S == rev : 15 if S == rev :
16 return 1 16 return 1
17 return 0 17 return 0
18 18
19 # { Driver Code Starts 19 # { Driver Code Starts
20 # Initial Template for Python 3 20 # Initial Template for Python 3
21 if __name__ == ’ __main__ ’: 21 if __name__ == ’ __main__ ’:
22 T = int ( input () ) 22 T = int ( input () )
23 for i in range ( T ) : 23 for i in range ( T ) :
24 S = input () 24 S = input ()
25 ob = Solution () 25 ob = Solution ()
26 answer = ob . isPalindrome ( S ) 26 answer = ob . isPalindrome ( S )
27 print ( answer ) 27 print ( answer )
28 # } Driver Code Ends 28 # } Driver Code Ends
(b) Student’s buggy program (c) Fixed program (d) Failing test case
The bug in the program is that the original string S is being Consider the effect
modified in the reverse function. When you compare the reversed of Python’s pass-by-
string with the original string, they are always the same because reference behavior on
they are both pointing to the same modified list. The fix is to mutable objects like lists
pass a copy of the original string to the reverse function so that when you’re reversing the
the original string remains unmodified. string.
(e) Detailed explanation (f) Single-sentence hint (g) Validation
Figure 1: Illustrative example showcasing GPT4Hints-GPT3.5Val for the Palindrome problem shown
in (a) from the BasicAlgo dataset. (b) shows a real-world buggy program. (c) shows a fixed program
generated by the technique in an intermediate step, and (d) shows a test case where the buggy program
fails to produce the correct output. (e) shows a detailed explanation generated by the technique that is
used later in the validation stage. (f) shows the generated feedback (a single-sentence hint). (g) highlights
that the validation stage of the technique successfully accepted the generated feedback as high-quality and
suitable for sharing with the student.
I. We showcase the utility of prompting the models with symbolic information, such as failing test cases
and fixed programs, to enhance their reasoning abilities about the underlying bugs crucial for providing
high-quality hints.
II. We showcase the utility of using LLMs in a flipped role as a “student” model to simulate the potential
effect of feedback on real human students. Our results highlight that using a weaker model (GPT-3.5,
instead of GPT-4) provides better validation of programming hints from GPT-4. This flipped role opens
up new opportunities in utilizing generative AI for in-context student modeling for automatic assessments,
learning analytics, and simulations.
III. Our technique achieves a precision of around 95% (reaching the quality of human tutors in our evaluation)
while maintaining a high coverage of over 70% across three real-world Python programming datasets.2
2 https://github.com/machine-teaching-group/lak2024_GPT4Hints-GPT3.5Val
3
The motivation of the problem is to investigate any evidence of a link between Design your function so that it re-
vaccine efficacy and sex of the child. For this, you should compute the ratio of turns a dictionary in the following
the number of children who contracted chickenpox but were vaccinated against format:
it (at least one varicella dose) versus those who were vaccinated but did not {"male": 0.2,
contract chicken pox. Return results by sex. "female": 0.4}
(a) Description of the programming task
(b) Student’s buggy program (c) Fixed program (d) Failing test case
The bug in the program is in the line where you are filtering the Check the conditions you
male children who were vaccinated. You are only considering those are using to filter the
who received exactly one dose of the vaccine (‘male[‘P_NUMVRC’] vaccinated children, espe-
== 1’), but the problem statement asks for children who received at cially the number of doses
least one dose of the vaccine. The fix is to change the condition to they received.
‘male[‘P_NUMVRC’] >= 1’.
(e) Detailed explanation (f) Single-sentence hint (g) Validation
Figure 2: Similar to Figure 1, this example showcases GPT4Hints-GPT3.5Val on a buggy program from
the DataAnalysis dataset.
Feedback generation for programming education. Prior to recent developments in generative AI and LLMs,
the research on feedback generation for programming education had primarily focused on fixing buggy
programs because of challenges in automatically generating natural language explanations [23, 24]. A
parallel line of research explored crowdsourcing approaches to obtain explanations provided by other stu-
dents/tutors [25]. Our work builds on recent developments in leveraging LLMs for generating programming
feedback [4, 10, 11, 26], in particular, motivated by recent survey [4] that highlighted a substantial gap
in GPT-4’s performance in terms of hints quality w.r.t. human tutors. Another closely related work is
[10] that proposed PyFiXV technique for generating high-precision feedback for syntax errors. PyFiXV
has a run-time feedback validation mechanism by leveraging OpenAI’s Codex-Edit model [27] at varying
temperatures as a “student” model. Inspired by [10], we also leverage an LLM-based “student” model
to perform validation. However, the validation mechanism used in PyFiXV is not directly applicable
to our setting as it is designed only for syntax errors that substantially simplify the validation process;
crucially, GPT4Hints-GPT3.5Val is designed to provide feedback for any types of errors a student might
encounter, including errors related to the program’s time complexity.
Enhancing a model’s generative performance. A series of recent works have focused on enhancing the gen-
erative performance of a base model in a black-box setting, given the high monetary or computational costs
involved in fine-tuning state-of-the-art models (in fact, the latest OpenAI’s GPT-4 model doesn’t have public
APIs for fine-tuning). These works operate either at the input level by developing better prompting strate-
gies [18] or at the output level by analyzing and correcting the generated content [10, 19, 20]. At the output
4
level enhancements, Self-Debugging [19] and Self-Refine [20] are two recently proposed methods that enable
an LLM to analyze and correct its output automatically. Another recent work in [28] introduced the concept
of Self-Repair that showed substantial performance gains when allowing an LLM to repair its output by
receiving feedback from a more powerful LLM or expert. The key intuition behind the validation mechanism
in GPT4Hints-GPT3.5Val differs from these works and is more related to [10] discussed above—we utilize
another LLM as a “student” model to simulate the potential effect of feedback on real human students.
Integration of generative AI in educational sites. There has also been increasing interest in integrating
generative AI and LLMs in educational sites. For instance, Khanmigo [29] by Khan Academy and Q-Chat
by Quizlet [30] are AI-powered systems based on OpenAI’s GPT models. These recent developments also
serve as our motivation to develop principled techniques that can generate high-quality feedback. Overall,
we see our work as complementary to these systems and believe that the proposed techniques can be useful
in further improving the performance of these systems.
2 Problem Setup
Programming task and student’s buggy program as input. We start with a programming task T and a buggy
program Pb . A task T , such as shown in Figures 1a and 2a, is represented by a textual description of
the programming problem. Additionally, this description encompasses all requisite information essential
for problem solving, such as expected algorithm complexity and any constraints on input, as applicable.
In cases where the task necessitates interaction with an external file, T should also contain all pertinent
information of that file crucial for solving the problem, such as the file’s format or structure. Pb , as
illustrated in Figures 1b and 2b, is an unsuccessful attempt of the student to solve T . This program fails
to pass at least one of the test cases in the test suite for T . In general, Pb may contain one or multiple
errors, spanning various error types including syntax and semantic errors.
Tutor-style hint as output and quality assessment. Given T and Pb , we aim to generate a human tutor-style
natural language hint H as feedback to aid the student in understanding and resolving the programming
error. We assess the quality of generated feedback along four quality attributes following the rubric used
in [4]. All attributes are binary, with a value of 1 being better. HCorrect captures whether the generated
hint provides correct information for resolving issues in the student’s buggy program. HInformative captures
whether the generated hint provides useful information to help the student resolve bug(s); this attribute is
set to 0 by default when the hint is incorrect. HConceal captures that the information in the generated hint
is not too detailed, so the student would also have to reason about implementing the fixes; this attribute
is set to 0 by default when the hint is incorrect. HComprehensible captures whether the generated hint
is easy to understand, presented in a readable format, and doesn’t contain redundant information. In our
evaluation, human experts (evaluators) assess the quality of generated hints along these four attributes. We
measure the overall quality of the generated hint by HOverall that takes the value of 1 (good quality) if all
the four quality attributes are satisfied and otherwise 0 (bad quality).
Performance metrics and objective. Next, we describe the overall performance metrics used to evaluate a
feedback generation technique. For a given student’s buggy program Pb , we seek to design techniques that
generate feedback and also decide whether the generated feedback is suitable for sharing with the student.
Similar to [10], we measure the performance of a technique using two metrics: (i) Coverage measuring the
percentage number of times the generated feedback is provided to the student; (ii) Precision measuring the
percentage number of times the provided feedback is of good quality w.r.t. the HOverall quality introduced
above. In our experiments, we will compute these metrics on a dataset comprising a set of students’ buggy
programs. Our goal is to design feedback generation techniques with high precision, which is imperative
before deploying such techniques in classrooms. In particular, we aim to develop techniques that achieve a
precision level of human tutors while maintaining an effective trade-off between precision and coverage.
5
ℋ
Generate Symbolic Data Generate Feedback Validate Feedback
𝒫$ 𝒫" ℋ
𝒯 Get fixed program 𝒫" Get single-sentence hint ℋ Simulate the utility of 𝒳
using GPT4 and failing IO 𝜔 𝜔 and detailed explanation 𝒳 𝒳 in fixing 𝒫$
Student is working on task 𝒯 Trials < 𝑘
using program analysis using GPT4 using GPT-3.5
and has a buggy program 𝒫$
Yes
ating feedback and GPT-3.5 as a simulated “student” model for feedback validation. In Section 3.1, we
describe two types of symbolic information that are helpful for generating feedback and how to obtain them;
in Section 3.2, we describe the process of feedback generation augmented with this symbolic information.
Subsequently, in Section 3.3, we introduce a novel validation mechanism aiming to elevate the precision of
the delivered feedback while maintaining a high level of coverage.
Overview and intuition. As discussed in Section 1, there remains a notable performance gap between state-
of-the-art generative AI models and human tutors regarding hint generation. One key factor contributing to
this disparity is the inability to do symbolic reasoning and program execution. GPT-4 lacks the capability to
execute the given code to retrieve an output, which could help it gain deeper understanding of the underlying
bugs. To mitigate this gap, we employ external tools to execute programs and extract useful symbolic
information. We then supply this information to GPT-4 for feedback generation. Our approach centers on
leveraging two categories of symbolic data: failing test cases and fixed programs. Input/output for a failing
test case. To highlight the error in the buggy program Pb , we provide GPT-4 with a test case for which Pb
fails to produce the expected output. To acquire this test case, we run Pb on the existing test suite given for
the corresponding task T . The first test case in which Pb fails is selected. We denote the triplet comprising
this input, the output generated by Pb , and the expected output, as ω and include it in the prompt for
feedback generation. Fixed program. The fixed program, denoted as Pf , is generated using GPT-4, employing
a procedure adapted from the work in [10]. To be more specific, we initiate the process by requesting the
model to produce 10 independent fixed programs. For this purpose, we include T and Pb in the prompt3 to
ask for 10 outputs (each output contains a fixed program) with the hyperparameter temperature set to 0.5.
Then, from this set of 10, we take the programs that pass the test suite for T and among them, identify Pf
as the one with the smallest token-edit distance w.r.t. Pb . To compute the token-edit distance between two
programs, we first tokenize them using Pygments library [31] and then calculate the Levenshtein edit distance
based on the tokenized strings. If Pf is found, we include it in the prompt for feedback generation. If, however,
none of the generated programs is correct, we opt to exclude this symbolic information from the prompt.
Overview and intuition. In this stage, we aim to obtain a human tutor-style hint H as feedback to be
given to the student, as previously mentioned in Section 2. In addition to our request for a hint H from
GPT-4, we also ask for a detailed explanation, denoted as X , for the bugs in Pb . The reason to ask for this
explanation draws inspiration from Chain-of-Thought [18], an established method renowned for enhancing
the reasoning capabilities of LLMs. The essence of the Chain-of-Thought approach lies in encouraging LLMs
to explain their thought process meticulously, step by step, prior to presenting the final output. Within the
specific context of hint generation, we allow the model to elaborate its reasoning through X before coming
up with the concise single-sentence hint H, which is essentially an abstracted version of the explanation.
Furthermore, X will also play a pivotal role in the subsequent feedback validation stage, which will be
elaborated upon in Section 3.3. Prompt for feedback generation. In Figure 4 (first prompt), we provide our
prompt for generating feedback. This prompt comprises the problem description for T , the buggy program
Pb , the symbolic information as extracted from the previous stage, and a request for an explanation X along
3 The prompt used here has the same format as shown in Figure 4 (third prompt).
6
with a hint H. To get a response from GPT-4, we use this prompt while configuring the hyperparameter
temperature to 0, indicating our preference for the most probable answer. All other hyperparameters are
kept at their default settings. Following this, X and H are then extracted automatically from the output.
Overview and intuition. This validation stage aims to enhance the precision of the feedback provided to the
student. It is worth noting that despite the inclusion of augmented symbolic information in the prompt, the
hint generated in Stage-2 may not always align with the desired quality criteria outlined in Section 2. To
mitigate this issue, we introduce a validation mechanism that adds a run-time quality assurance layer and
decides whether the generated feedback is suitable for sharing with the student. The key idea behind this
validation mechanism is to leverage an additional AI model as a “student” model to simulate the potential
utility of providing this feedback to human students. More concretely, we seek to evaluate the quality of
feedback by assessing its impact on the simulated students’ ability to fix the bugs. If the simulated students
find it easier to fix Pb with the help of the feedback, then the feedback is deemed high-quality and can
be subsequently provided to the real student. In terms of the “student” model, we use a weaker model
GPT-3.5, instead of GPT-4. The key intuition is that a weaker model provides a better differential effect
in quantifying the utility of feedback in fixing the buggy program; moreover, we use the “student” model at
a high temperature to add further stochasticity in the process of fixing the program.4 . Furthermore, we will
use the detailed explanation X (instead of the single-sentence hint H) to assess the utility of feedback for
fixing the bugs. In our evaluation (Section 4.4 and Figure 7), we will demonstrate the effectiveness of these
design choices.
Two prompts for validation. Figure 4 (second and third prompts) illustrates the two prompts used by the
feedback validation mechanism. Both prompts essentially instruct the “student” model (GPT-3.5) to fix
Pb . The primary distinction lies in the fact that, in contrast to the third (standard) prompt, the second
(augmented) prompt additionally incorporates the explanation X . More concretely, the third (standard)
prompt is the same as the prompt used in Stage-1 when generating a fixed program; the second (augmented)
prompt puts emphasis on the detailed explanation to serve as an instruction for the “student” model when
fixing the program. For each prompt, we ask GPT-3.5 to generate a set of n = 10 independent outputs
(the temperature is set to 0.5, similar to in Stage-1), effectively utilizing GPT-3.5 in the role of 10 simulated
students. We shall denote the number of correct output programs resulting from the standard prompt as n1 ,
and the number of correct output programs resulting from the augmented prompt as n2 . The correctness of
a program is determined by its ability to pass the whole test suite for the corresponding task T . Next, we
explain how we use these quantities for feedback validation.
Validation threshold rules. Our main idea for validation is that good feedback should help students find it
easier to fix the buggy program than without it. Thus, the primary rule for feedback validation is to have nn2 ≥
n1
n . Nonetheless, in situations where n1 assumes particularly low values, e.g., n1 = 0 or n1 = 1, this condition
becomes less stringent, and any feedback, regardless of its quality, may pass the validation. To address this,
we incorporate an additional requirement to ensure that nn2 attains a sufficient levelindependently. This is
achieved through the inclusion of the following condition: nn2 ≥ α ∨ nn2 ≥ nn1 + β , where we instantiate
α as 0.50 and β as 0.25. In other words, we require the ratio of correct output programs generated with
the help of the explanation to either exceed a certain fixed threshold (i.e., nn2 ≥ 0.5) or be substantially
higher than the ratio of correct output programs generated without the explanation (i.e., nn2 ≥ nn1 + 0.25),
or both. Consequently,our final validation mechanism approves a feedback instance only when the following
n2 n1
n2
n2 n1
condition holds true: n ≥ n ∧ n ≥ 0.50 ∨ n ≥ n + 0.25 , and rejects it otherwise. In our
experiments (Section 4), we will also compare the performance of different variants of threshold rules.
Multiple trials. When the validation mechanism rejects a feedback instance, it is not provided to the human
student. While this is expected to boost the precision metric, it could also lead to a significant drop in the
coverage metric [10]. Given the stochasticity of the generation and validation processes, we introduce an
additional layer to the overall process to boost the coverage while ensuring high precision. More concretely, if
4 We refer the reader to recent results in [4, 32] to see the performance of different GPT-based models across various
7
Prompt to Generate Feedback
I’m working on a Python programming problem. The current program below is not working well. Can you help by
giving a hint?
Problem description:
{problem_description}
Buggy program:
{buggy_program}
(1) Can you describe the bug(s) in this program and the required fixes?
(2) Can you provide a concise single-sentence hint about one bug in this program? The hint should not be too detailed
as I want to think about the fixes by myself. However, the hint should not be too abstract, as I need some help.
Problem description:
{problem_description}
Buggy program:
{buggy_program}
If anything in the explanation above is incorrect or too confusing, please say “Explanation is bad.” and stop.
If all the reasoning in the explanation above is correct and easy to understand, then please fix the buggy program
according to the explanation above. In this case, note that the explanation above may not cover all bugs (if there are
multiple bugs) in the buggy program, so you need to think to resolve the remaining bugs by yourself.
Problem description:
{problem_description}
Buggy program:
{buggy_program}
Can you fix the above buggy program? Make sure that you make minimal possible changes needed to fix the
program.
Figure 4: Prompts employed by GPT4Hints-GPT3.5Val for feedback generation (first) and feedback
validation (second and third).
8
Properties BASICALGO DATAREGEX DATAANALYSIS
Number of programming tasks 5 1 1
Number of buggy programs 25 24 30
Average lines of student code 10.7 2.2 12.1
Task’s objective Write an algorithm in Python Fix a regular expression in Python Perform data analysis in Python
Domain and concepts Python syntax, basic algorithms Regular expressions, information extraction pandas library, data analysis
Figure 5: Overview of the datasets used in this work. See Section 4.1 for details.
a feedback instance is rejected, we restart the process, including acquiring symbolic information, generating
hints, and the subsequent validation. We maintain this iterative cycle until either a generated feedback
instance is approved by the validation mechanism or a predefined maximum number of iterations, denoted
as k, is attained (we set k = 3). After k trials, if none of the feedback instances pass validation, we terminate
this outer loop and will not provide any feedback to the human student. When deploying our technique in
real-world classroom settings, where no automatic feedback is being provided, a human tutor could step in
and take over the work of providing feedback to the student.
4 Experimental Evaluation
In this section, we evaluate our technique, GPT4Hints-GPT3.5Val, across three datasets spanning differ-
ent domains of introductory Python programming. We assess GPT4Hints-GPT3.5Val in comparison to
baselines such as GPT-4 and human tutors. Furthermore, we compare our validation with various alterna-
tive variants. In our experiments, we use OpenAI’s GPT-4 (model=gpt-4-0613 ) as the “tutor” model and
ChatGPT based on GPT-3.5 (model=gpt-3.5-turbo-0613 ) as the “student” model unless otherwise stated.
4.1 Datasets
To comprehensively assess the techniques’ performance across diverse domains within introductory program-
ming education, we use three datasets representing different types of learning objectives, as summarized in
Figure 5. All datasets consist of students’ Python buggy programs. Below, we provide a detailed description
of each of these datasets.
The first dataset, BasicAlgo, was introduced in [4]. It covers five popular introductory Python problems,
and for each problem, there are five corresponding buggy programs. The problems capture a diverse set of
basic programming concepts and include the following: GCD (finding the greatest common divisor of two
given numbers), Fibonacci (generating the list of Fibonacci numbers up to a given value), DivisorsDiv3
(counting the number of divisors that divide 3 of a given number), Palindrome (checking whether a
given string is palindrome or not), and MergeStrs (merging two given strings alternatively). The buggy
programs come from different users on the geeksforgeeks.org platform [33], and capture a variety of bug
types and code lengths. Figures 1 and 10 show two examples of buggy programs with bugs related to
misconception regarding the mutability of lists and a mistake regarding the ordering of the merging strings.
The second dataset, DataRegex, comes from an introductory data science programming course. This course
is a part of an online Master’s degree program in applied data science; students enrolling in the course are
required to have basic Python programming and statistics knowledge. We examine the second exercise from
the first assignment of the course, which requires students to use regular expressions to extract information
from a text file. In particular, the text file contains people’s names and their corresponding grades; the
students need to fix a given buggy function so that it correctly reads the file, matches a regular expression,
captures and returns a list of people who got a grade of B.5 To solve the problem, students need knowledge
of basic regular expression concepts such as wildcard characters, grouping, look around, and quantification.
This dataset contains 24 buggy submissions, each from a unique student. For each student, if there are
multiple buggy submissions, we take only the median submission w.r.t to submission times to include in the
5 For GPT-4, instead of giving it the file, we describe the file format in the prompt; the description is provided as part of
9
dataset. Some common types of bugs are mishandling of grouping (Figure 9), returning names of all people,
and returning only people’s last names. It is worth noting that there is only one test case in the test suite
for this problem; this is in contrast to algorithmic problems, such as the ones in BasicAlgo, in which the
test suites usually comprise a large number of input/output cases.
The third dataset, DataAnalysis, is from the second exercise of the second assignment in the same data
science course. By that time, the students learnt to use data manipulation libraries such as pandas to load,
filter, and extract meaningful information from data-frames. For this problem, the students are given a csv
format file that contains a data-frame, a 252-page data guide PDF,6 a problem description, and a function
signature. The students need to complete the given empty function to compute the ratios of vaccinated
children who contracted chickenpox versus those who were vaccinated but did not contract chickenpox,
separated by sex. To solve this problem, besides the basic Python syntax, the students also need to know
how to select and use relevant libraries (such as pandas), understand and search for relevant information
from the extensive data guide, and deal with missing data. To form this third dataset, we sample 30 buggy
programs using the same procedure as used for second dataset. Some bugs in the dataset are: mis-filtering
of data (Figure 2), misreading of the requirements and computing a wrong ratio, and forgetting to handle
or wrongly handling of missing values.
Baseline GPT-4 and human tutors. As our first baseline, we employ GPT-4 in a straightforward manner
by presenting it with the task description and the buggy program in the prompt to generate feedback. The
format of the prompt closely resembles that depicted in Figure 4 (first prompt), albeit without the inclusion
of additional symbolic information. The second baseline employs human tutors with experience in Python
programming and tutoring, which serves as the gold standard for our technique to match. In our experiments,
two human tutors are employed to give hints independently. From here on, we refer to these baselines as
GPT4Hints-Base and TutorHints, respectively.
Variants of our technique without validation. As mentioned previously, we introduce two additional types of
symbolic information into our prompt for feedback generation. These additions consist of a failing test case
and a fixed program, given that a correct fixed program can be produced (see Section 3.1). Accordingly,
we have formulated two variant techniques: (i) GPT4Hints-IO involves enhancing GPT4Hints-Base by
incorporating the failing test case into the prompt; (ii) GPT4Hints-IOFix integrates both of these types
of symbolic information into the prompt. Note that neither of these techniques employ any validation, i.e.,
the generated feedback is always deemed suitable for sharing.
Variations of validation stage in our technique. Next, we will consider variants of GPT4Hints-GPT3.5Val
in terms of the validation stage. First, we look at the role of multiple trials when a feedback instance fails
validation. We compare our technique with a variant where there is only a single trial (i.e., k = 1). Second,
we examine the performance when GPT-4 is used as the simulated “student” model instead of GPT-3.5.
Third, we investigate the case wherein the generated single-sentence hint, instead of the detailed explanation,
is utilized in the validation process. Fourth and last, we vary the threshold rule used for validation. In this
regard, there are three variations: nn2 ≥ α , where n2 n1 n2
n 1 is not considered in the rule; n ≥ n ∧ n ≥ α
where β is not considered in the rule; nn2 ≥ nn1 where α and β are not considered in the rule.
As discussed in Section 2, we employ human experts (evaluators) to assess the quality of generated feedback.
More concretely, two human evaluators independently rated the feedback generated by techniques along the
quality attributes as introduced in Section 2.7 Then, given the ratings from each evaluator, we compute
6 The data guide is meant to exercise students on extracting relevant information. Typically, students would search the PDF
using keywords such as ’chickenpox’ to spot relevant columns needed. For GPT-4, we extract and provide in the prompt a short
summary describing the relevant columns; the summary is provided as part of our implementation (see Footnote 2).
7 Similar to [4], these two human evaluators are same as two human tutors employed in the TutorHints technique. When
evaluating TutorHints technique, an evaluator does not assess their own feedback produced while acting as a tutor.
10
Technique BasicAlgo DataRegex DataAnalysis
Precision Coverage Precision Coverage Precision Coverage
GPT4Hints-Base 66.0 (2.0) 100.0 85.4 (2.1) 100.0 78.3 (5.0) 100.0
GPT4Hints-IO 72.0 (4.0) 100.0 85.4 (2.1) 100.0 85.0 (5.0) 100.0
GPT4Hints-IOFix 82.0 (2.0) 100.0 91.7 (4.2) 100.0 93.3 (3.3) 100.0
TutorHints 92.0 (4.0) 100.0 91.7 (4.2) 100.0 91.7 (8.3) 100.0
GPT4Hints-GPT3.5Val 94.7 (0.0) 76.0 (0.0) 97.6 (2.4) 87.5 (0.0) 95.5 (4.5) 73.3 (0.0)
Figure 6: Results for different techniques on three real-world Python programming datasets. For each
technique and dataset, results are averaged across two evaluators and reported as mean (stderr) as per the
evaluation procedure in Section 4.3. Our technique, GPT4Hints-GPT3.5Val, performs validation of the
generated feedback to achieve a higher quality of the feedback in terms of precision level, thereby trading off
precision and coverage. Our technique can achieve a precision of around 95% reaching the quality of human
tutors while maintaining a high coverage of over 70% across three real-world datasets; see Section 4.4 for a
detailed discussion of results.
Single trial k = 1 instead of k = 3 91.7 (0.0) 48.0 96.4 (3.6) 58.3 94.4 (5.6) 60.0
GPT-4 student model instead of GPT-3.5 student model 84.8 (2.2) 92.0 93.5 (2.2) 95.8 93.1 (3.4) 96.7
Using single-sentence hint H instead of detailed explanation X 89.3 (3.6) 56.0 93.5 (2.2) 95.8 95.0 (5.0) 66.7
Threshold rule without considering n1 , i.e., nn2 ≥ 0.50
86.8 (2.6) 76.0 91.7 (4.1) 100.0 95.5 (4.5) 73.3
n2 n1 n2
Simplified threshold rule without β, i.e., n ≥ n ∧ n ≥ 0.50 94.1 (0.0) 68.0 97.6 (2.4) 87.5 95.5 (4.5) 73.3
Simplified threshold rule without α, β, i.e., nn2 ≥ nn1
95.2 (0.0) 84.0 97.6 (2.4) 87.5 92.3 (3.8) 86.7
Figure 7: Comparison of performance between GPT4Hints-GPT3.5Val and different variants w.r.t the
validation stage. The first four variations (single trial, GPT-4 student model, using H, and threshold without
considering n1 ) show how different design choices in our validation stage helps improve precision-coverage
trade off. The last two variations with simplified threshold rules shows the robustness of the default threshold
rule in terms of α and β. See Sections 3.3 and 4.4 for further details.
precision and coverage (based on the overall feedback quality HOverall).8 Finally, for each technique and
dataset, we aggregate across evaluators and report averaged results as mean (stderr). We obtained Cohen’s
kappa reliability value 0.65 indicating substantial agreement between evaluators [34]. Next, we elaborate on
our experimental results.
4.4 Results
Comparison with baselines and human tutors. Figure 6 provides an overview of results, comparing our tech-
nique and baselines. It is evident that GPT4Hints-Base exhibits a substantial performance gap when
compared to TutorHints. This gap is partially mitigated with the incorporation of failing test cases and
fixed programs in the prompt, as seen with GPT4Hints-IO and GPT4Hints-IOFix, respectively.9 Our fi-
nal technique, GPT4Hints-GPT3.5Val, consistently achieves precision levels comparable to TutorHints,
8 In addition, we also asked the evaluators to rate on ECorrect, a binary attribute capturing the correctness of the detailed
explanation X . Further analysis regarding this additional attribute will be discussed in Section 4.4 and Figure 8.
9 If, for a buggy program, no correct fixed program is obtained (see Section 3.1), the prompt of GPT4Hints-IOFix is the same
as GPT4Hints-IO’s. The rates at which we obtained at least one correct fix for BasicAlgo, DataRegex, and DataAnalysis
datasets are 92%, 100%, and 93%, respectively.
11
Method Hint Explanation (Hint, Explanation)
HOverall HCorrect HInformative HConceal HComprehensible ECorrect HOverall, ECorrect
GPT4Hints-Base 66.0 68.0 66.0 68.0 100.0 58.0 56.0
GPT4Hints-IO 72.0 78.0 74.0 76.0 98.0 66.0 62.0
GPT4Hints-IOFix 82.0 84.0 82.0 84.0 100.0 82.0 80.0
GPT4Hints-GPT3.5Val 94.7 94.7 94.7 94.7 100.0 91.1 92.1
Figure 8: Fine-grained results w.r.t. evaluation rubric that assesses the quality of generated feedback
across different attributes as discussed in Sections 2 and 4.3. For our technique, these fine-grained results
demonstrate a high correlation between generating a high-quality hint and a correct detailed explanation
(used in the validation stage).
around 95% across all datasets.10 Importantly, the trade-off in coverage required to attain such high precision
is effective, and our technique maintains a coverage rate exceeding 70% for all three datasets. In Figure 8, we
provide fine-grained results across different attributes, demonstrating a high correlation between generating
a high-quality hint and a correct detailed explanation – this further justifies why the explanation can be
used to validate the hint.
Comparison with variations of validation stage. Figure 7 shows the performance of different variants in
comparison to our technique. Notably, with a single trial (i.e., k = 1), there is a substantial decrease
in coverage across all datasets. This result underscores the marked effect of incorporating multiple trials
in maintaining a high coverage level. Intriguingly, when we substitute GPT-3.5 with the more advanced
model, GPT-4, as the simulated “student” model, there is actually a reduction in precision. We observed that
GPT-4 is worse than GPT-3.5 in terms of achieved precision as it tends to correctly fix the buggy program
even if the explanation in the validation prompt is wrong. These results highlight that a weaker model (here,
GPT-3.5 instead of GPT-4) could be better suited as a simulated “student” model. When using hints
instead of explanations for validation, it yields inferior performance in general as the explanation contains
more details about the bugs and fixes (thus having a better differential effect between using the standard
and the augmented prompt). Regarding variants of the validation rule, the overall performance remains
relatively stable when α and β are excluded from the rule, suggesting a robust performance irrespective of
specific settings for these hyperparameters. However, a noticeable decline in performance is observed when
the relative condition ( nn2 ≥ nn1 ) is omitted, highlighting its importance in the validation process.
Qualitative analysis. We have included a few illustrative examples to showcase the effectiveness of our tech-
nique. Figures 1, 2, and 9 exemplify cases where GPT4Hints-GPT3.5Val generated high-quality feedback
during Stage-2 and then successfully accepted during Stage-3. Conversely, for the scenario in Figure 10,
GPT4Hints-GPT3.5Val’s Stage-2 failed to produce high-quality feedback in all three trials, but Stage-3
successfully rejected all of those low-quality feedback instances. To be more specific, the values of n1 and n2
for the three trials in this case were {n1 = 8, n2 = 0}, {n1 = 6, n2 = 0}, and {n1 = 5, n2 = 0}, respectively.
In contrast, in the example shown in Figure 1, GPT4Hints-GPT3.5Val’s Stage-2 generated high-quality
feedback during the first trial and Stage-3 subsequently accepted it with values {n1 = 2, n2 = 6}. We have
provided additional illustrative examples as part of our implementation (see Footnote 2).
5 Concluding Discussions
We investigated the role of generative AI and large language models in providing human tutor-style program-
ming hints to help students resolve errors in their buggy programs. In particular, we focused on improving the
quality of generated feedback, which is crucial for deployment in real-life classroom settings. We developed a
novel technique, GPT4Hints-GPT3.5Val, that leverages GPT-4 as a “tutor" model to generate hints and
GPT-3.5 as a “student" model to validate the hint quality. This validation step provides a layer of quality
assurance by trading off coverage (how many students are given automatic feedback) and precision (quality
10 When comparing GPT4Hints-GPT3.5Val with other techniques in Figure 6, the results are significantly different w.r.t. χ2
tests [35] (p ≤ 0.0001); here, we use contingency tables with two rows (techniques) and four columns (data points are mapped
to four possible precision/coverage outcomes).
12
You are given a data file at ‘assets/grades.txt’ that contains The first few lines of the data file are illustrated
multiple entries representing individuals and their respective below:
grades. Fix the incorrect regex between ### FIX CODE BE- Jane Doe: A
LOW and ### FIX CODE ABOVE to return a list of just John Doe: B
those individuals who received a grade of ‘B’. ...
(a) Description of the programming task
1
2
def student_ grades () :
import re
1
2
def student _grades () :
import re
Your Code’s
3 with open ( " assets / grades . txt " , " r " ) 3 with open ( " assets / grades . txt " , " r " ) output is:
as f : as f :
4 grades = f . read () 4 grades = f . read () [(‘John’, ‘Doe’), ...]
5 5
6 # ## FIX CODE BELOW 6 # ## FIX CODE BELOW
7 re_pattern = " (\ w +) (\ w +) (?=: B ) " 7 re_pattern = " (\ w + \ w +) (?=: B ) " It’s Correct
8 matches = re . findall ( re_pattern , grades ) 8 matches = re . findall ( re_pattern , grades )
9 # ## FIX CODE ABOVE 9 # ## FIX CODE ABOVE output is:
10 10
11 return matches 11 return matches [‘John Doe’, ...]
(b) Student’s buggy program (c) Fixed program (d) Failing test case
The bug in the program is in the regular expression pattern. The Consider how the regular
pattern ‘(\w+) (\w+)(?=: B)’ is capturing two separate word expression groups are de-
groups, which results in a list of tuples. The fix is to capture the full fined and how they affect
name as one group by using ‘(\w+ \w+)(?=: B)’. This will return the output format.
a list of full names as strings.
(e) Detailed explanation (f) Single-sentence hint (g) Validation
Figure 9: Similar to Figure 1, this example showcases GPT4Hints-GPT3.5Val on a buggy program from
the DataRegex dataset.
of the given feedback). We performed an extensive evaluation to showcase the efficacy of our technique on
three real-world Python programming datasets, reaching the precision-level of human tutors.
Our work has two important implications for the research community interested in leveraging generative AI
and large language models for computing and programming education. First, our results show how we can
effectively utilize these models as “tutor” by prompting them with symbolic data such as failing test cases.
This symbolic data essentially provides in-context information to enhance the reasoning and execution abili-
ties of these models where they typically struggle. Second, our results show how we can utilize these models
in a flipped role as “student” to simulate the effect of feedback on a real human student. Interestingly, we also
showed that a weaker model (GPT-3.5, instead of GPT-4) serves as a better “student” model for validating
the effect of feedback generated by GPT-4. This flipped role opens up new opportunities in utilizing gen-
erative models as in-context student models for automatic assessments, learning analytics, and simulations.
Next, we discuss some limitations of our current work and ideas to tackle them in the future. First, our
work involved OpenAI’s GPT family of models; it would be useful to evaluate alternate generative models,
in particular, open-source variants like Llama-2. Moreover, we utilized the GPT-3.5 model at a higher
temperature to simulate the potential utility of providing feedback; it would be interesting to investigate
how to employ different LLMs to better simulate diverse student behaviors. Second, our work didn’t leverage
historical data on a given problem when generating hints, e.g., hints provided by human tutors for previous
students’ buggy attempts on a problem. It would be important to develop techniques that can leverage this
data, e.g., by fine-tuning these open-source variants to generate better-quality hints. Third, our evaluation
considered small datasets comprising a total of 79 buggy programs; it would be useful to scale up the
studies by considering larger-scale datasets. Fourth, we focused only on Python programming education; it
would be interesting to conduct a similar study for other programming languages and other domains beyond
programming. Fifth, our evaluation only considered expert-based annotations and didn’t involve students;
it would be important to conduct studies with students to evaluate techniques from their perspectives.
13
Given two strings S1 and S2 as input, the task is to merge them Expected Time Complexity: O(|S1| + |S2|).
alternatively, i.e., the first character of S1 then the first character Constraints: 1 ≤ |S1|, |S2| ≤ 103 .
of S2 and so on till the strings end.
(a) Description of the programming task
1 # User function Template for python3 1 # User function Template for python3 For Input: Qh eyNFX
2 class Solution : 2 class Solution :
3 def merge ( self , S1 , S2 ) : 3 def merge ( self , S1 , S2 ) : Your Code’s output is:
4 l1 = len ( S1 ) 4 l1 = len ( S1 )
5 l2 = len ( S2 ) 5 l2 = len ( S2 ) eQyhNFX
6 if l1 > l2 : 6 if l1 > l2 : It’s Correct output is:
7 res = " " 7 res = " "
8 dif = l1 - l2 8 dif = l1 - l2 QehyNFX
9 x = S1 [: l2 ] 9 x = S1 [: l2 ]
10 y = S2 10 y = S2
11 d =[( i , j ) for i , j in zip (x , y ) ] 11 d =[( i , j ) for i , j in zip (x , y ) ]
12 for i in d : 12 for i in d :
13 for j in i : 13 for j in i :
14 res += j 14 res += j
15 return res + S1 [ - dif :] 15 return res + S1 [ l2 :]
16 elif l1 < l2 : 16 elif l1 < l2 :
17 res = " " 17 res = " "
18 dif = l2 - l1 18 dif = l2 - l1
19 x = S2 [: l1 ] 19 x = S1
20 y = S1 20 y = S2 [ l1 :]
21 d =[( i , j ) for i , j in zip (x , y ) ] 21 d =[( i , j ) for i , j in zip (x , y ) ]
22 for i in d : 22 for i in d :
23 for j in i : 23 for j in i :
24 res += j 24 res += j
25 return res + S2 [ - dif :] 25 return res + S2 [ l1 :]
26 else : 26 else :
27 res = " " 27 res = " "
28 x = S1 [: l2 ] 28 x = S1
29 y = S2 29 y = S2
30 d =[( i , j ) for i , j in zip (x , y ) ] 30 d =[( i , j ) for i , j in zip (x , y ) ]
31 for i in d : 31 for i in d :
32 for j in i : 32 for j in i :
33 res += j 33 res += j
34 return res 34 return res
35 35
36 # { Driver Code Starts 36 # { Driver Code Starts
37 # Initial Template for Python 3 37 # Initial Template for Python 3
38 if __name__ == ’ __main__ ’: 38 if __name__ == ’ __main__ ’:
39 t = int ( input () ) 39 t = int ( input () )
40 for _ in range ( t ) : 40 for _ in range ( t ) :
41 S1 , S2 = map ( str , input () . 41 S1 , S2 = map ( str , input () .
strip () . split () ) strip () . split () )
42 ob = Solution () 42 ob = Solution ()
43 print ( ob . merge ( S1 , S2 ) ) 43 print ( ob . merge ( S1 , S2 ) )
44 # } Driver Code Ends 44 # } Driver Code Ends
(b) Student’s buggy program (c) Fixed program (d) Failing test case
The bug in the program is in the slicing of the strings when one string Consider how you are
is longer than the other. In the original program, when S1 is longer slicing the longer string
than S2, the remaining characters of S1 are not correctly added to the when one string is
result. The same issue occurs when S2 is longer than S1. The fix is to longer than the other.
slice the longer string from the length of the shorter string to the end,
instead of slicing from the end.
(e) Detailed explanation (f) Single-sentence hint (g) Validation
Figure 10: Similar to Figure 1, this example showcases GPT4Hints-GPT3.5Val on a buggy program for the
MergeStrs problem from the BasicAlgo dataset. For this example, the generated detailed explanation
and single-sentence hint feedback are not correct (e.g., the explanation suggests fixing the program based
on a different slicing strategy, which is not related to the bug in this program). The validation stage of the
technique (that evaluates the potential utility of this detailed explanation, cf. Figure 3) successfully rejected
the generated hint as low-quality and not suitable for sharing with the student. See Section 4.4 for further
discussion of results.
Acknowledgments. Funded/Co-funded by the European Union (ERC, TOPS, 101039090). Views and
opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European
Union or the European Research Council. Neither the European Union nor the granting authority can be
held responsible for them.
14
References
[1] OpenAI. GPT-4 Technical Report. CoRR, abs/2303.08774, 2023.
[2] OpenAI. ChatGPT. https://openai.com/blog/chatgpt, 2023.
[3] Sébastien Bubeck et al. Sparks of Artificial General Intelligence: Early Experiments with GPT-4. CoRR,
abs/2303.12712, 2023.
[4] Tung Phung, Victor-Alexandru Pădurean, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Ma-
jumdar, Adish Singla, and Gustavo Soares. Generative AI for Programming Education: Benchmarking
ChatGPT, GPT-4, and Human Tutors. In ICER V.2, 2023.
[5] Adish Singla. Evaluating ChatGPT and GPT-4 for Visual Programming. In ICER V.2, 2023.
[6] Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. Automatic Generation of Programming
Exercises and Code Explanations Using Large Language Models. In ICER, 2022.
[7] Victor-Alexandru Pădurean, Georgios Tzannetos, and Adish Singla. Neural Task Synthesis for Visual
Programming. CoRR, abs/2305.18342, 2023.
[8] Stephen MacNeil, Andrew Tran, Arto Hellas, Joanne Kim, Sami Sarsa, Paul Denny, Seth Bernstein,
and Juho Leinonen. Experiences from Using Code Explanations Generated by Large Language Models
in a Web Software Development E-Book. In SIGCSE, 2023.
[9] Jialu Zhang, José Cambronero, Sumit Gulwani, Vu Le, Ruzica Piskac, Gustavo Soares, and Gust Ver-
bruggen. Repairing Bugs in Python Assignments Using Large Language Models. CoRR, abs/2209.14876,
2022.
[10] Tung Phung, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, and
Gustavo Soares. Generating High-Precision Feedback for Programming Syntax Errors using Large
Language Models. In EDM, 2023.
[11] Juho Leinonen, Arto Hellas, Sami Sarsa, Brent N. Reeves, Paul Denny, James Prather, and Brett A.
Becker. Using Large Language Models to Enhance Programming Error Messages. In SIGCSE, 2023.
[12] GitHub. GitHub Copilot: Your AI Pair Programmer. https://github.com/features/copilot, 2022.
[13] Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. Reading Between the Lines:
Modeling User Behavior and Costs in AI-Assisted Programming. CoRR, abs/2210.14306, 2022.
[14] Samim Mirhosseini, Austin Z. Henley, and Chris Parnin. What is Your Biggest Pain Point? An
Investigation of CS Instructor Obstacles, Workarounds, and Desires. In SIGCSE, 2023.
[15] Yejin Bang et al. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallu-
cination, and Interactivity. CoRR, abs/2302.04023, 2023.
[16] Natalie Kiesler, Dominic Lohr, and Hieke Keuning. Exploring the Potential of Large Language Models
to Generate Formative Programming Feedback. In FIE, 2023.
[17] Tiffany Wenting Li, Silas Hsu, Max Fowler, Zhilin Zhang, Craig B. Zilles, and Karrie Karahalios. Am
I Wrong, or Is the Autograder Wrong? Effects of AI Grading Mistakes on Learning. In ICER, 2023.
[18] Jason Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS,
2022.
[19] Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching Large Language Models to
Self-Debug. CoRR, abs/2304.05128, 2023.
[20] Aman Madaan et al. Self-Refine: Iterative Refinement with Self-Feedback. CoRR, abs/2303.17651,
2023.
[21] Hugo Touvron et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR, abs/2307.09288,
2023.
[22] Wes McKinney et al. pandas: A Foundational Python Library for Data Analysis and Statistics. Python
for High Performance and Scientific Computing, 14(9):1–9, 2011.
[23] Rishabh Singh, Sumit Gulwani, and Armando Solar-Lezama. Automated Feedback Generation for
Introductory Programming Assignments. In PLDI, 2013.
15
[24] Sumit Gulwani, Ivan Radicek, and Florian Zuleger. Automated Clustering and Program Repair for
Introductory Programming Assignments. In PLDI, 2018.
[25] Andrew Head, Elena L. Glassman, Gustavo Soares, Ryo Suzuki, Lucas Figueredo, Loris D’Antoni, and
Björn Hartmann. Writing Reusable Code Feedback at Scale with Mixed-Initiative Program Synthesis.
In Learning @ Scale, 2017.
[26] Maciej Pankiewicz and Ryan Shaun Baker. Large Language Models (GPT) for Automating Feedback
on Programming Assignments. CoRR, abs/2307.00150, 2023.
[27] OpenAI. Codex-Edit. https://beta.openai.com/playground?mode=edit&model=
code-davinci-edit-001, 2022.
[28] Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama.
Demystifying GPT Self-Repair for Code Generation. CoRR, abs/2306.09896, 2023.
[29] Khan Academy. Khanmigo. https://www.khanacademy.org/khan-labs, 2023.
[30] Quizlet. Q-chat. https://quizlet.com/qchat-personal-ai-tutor, 2023.
[31] Georg Brandl, Matthäus Chajdas, and Jean Abou-Samra. Pygments. https://pygments.org/, 2006.
[32] Jaromír Savelka, Arav Agarwal, Christopher Bogart, Yifan Song, and Majd Sakr. Can Generative
Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses? In
ITiCSE, 2023.
[33] geeksforgeeks.org. GeeksforGeeks: A Computer Science Portal for Geeks. https://www.
geeksforgeeks.org/, 2009.
[34] Jacob Cohen. A coefficient of agreement for nominal scales. Educational and psychological measurement,
20(1):37–46, 1960.
[35] William G Cochran. The χ2 Test of Goodness of Fit. The Annals of Mathematical Statistics, 1952.
16