Chatgpt in Classroom - Research Paper
Chatgpt in Classroom - Research Paper
M. Popovicia
a
POLITEHNICA University of Bucharest, Computer Science dDepartment. Splaiul
Independentei 313, Bucharest, Romania.
arXiv:2401.11166v2 [cs.HC] 31 Jan 2024
ARTICLE HISTORY
Compiled February 1, 2024
ABSTRACT
In November 2022, OpenAI has introduced ChatGPT – a chatbot based on su-
pervised and reinforcement learning. Not only can it answer questions emulating
human-like responses, but it can also generate code from scratch or complete cod-
ing templates provided by the user. ChatGPT can generate unique responses which
render any traditional anti-plagiarism tool useless. Its release has ignited a heated
debate about its usage in academia, especially by students. We have found, to our
surprise, that our students at POLITEHNICA University of Bucharest (UPB) have
been using generative AI tools (ChatGPT and its predecessors) for solving home-
work, for at least 6 months. We therefore set out to explore the capabilities of
ChatGPT and assess its value for educational purposes. We used ChatGPT to solve
all our coding assignments for the semester from our UPB Functional Programming
course. We discovered that, although ChatGPT provides correct answers in 68% of
the cases, only around half of those are legible solutions which can benefit students
in some form. On the other hand, ChatGPT has a very good ability to perform code
review on student programming homework. Based on these findings, we discuss the
pros and cons of ChatGPT in a teaching environment, as well as means for integrat-
ing GPT models for generating code reviews, in order to improve the code-writing
skills of students.
KEYWORDS
learning and generative AI, Empirical Studies of Programming and Software
Engineering. Human Language Tehnologies in program development.
1. Introduction
ChatGPT has experienced a surge in popularity in recent times. There is a great deal of
lively debate surrounding its potential uses, impact on research: Stokel-Waker (2023),
science: Frieder (2023), and education: Elsen-Rooney (2023); Jalil (2023); Lambeets
(2023); Lau (2023). Since its increase in popularity in November 2022, academia has
been testing its limits with surprising results in its apparent proficiency to address a
wide range of questions. ChatGPT is the most recent addition to a series of genera-
tive pre-trained transformers (GPTs) that utilize language AI models trained on vast
amounts of human text. Its purpose is to generate discourses that closely resemble
2
sible for the average student. We were also interested to see to what extent ChatGPT
is able to correct itself when errors within its answers are highlighted. We discovered
that ChatGPT could enhance its score from 7 to 8.6, once a follow-up question or issue
was highlighted. However, prompting such issues requires a certain level of expertise
from the average student. To this end, we also used ChatGPT to generate tests for
each of its solutions. We have found that only 70% of generated tests are correct, and
not even follow-up questions can improve this result. Our labs do not provide tests for
students in our assignments. Thus, when a test fails, students may find it difficult to
determine whether the issue lies with the solution or in the test supplied by ChatGPT.
The results for our evaluation study as well as an in-depth comparison between human
and code-generated errors are detailed in Section 3.
Lastly, during our assessment, we noticed that ChatGPT can accurately respond
to qualitative questions concerning the code we submitted. Examples of such ques-
tions are: “is the code functional?”, “does the code use side-effects?”, “is the given
function tail-recursive?”. This observation prompted us to develop a tool which relies
on ChatGPT for code-reviewing submitted homework. Unlike testing, which ensures
the correctness of a submitted solution, the objective of code review is to assess and
possibly improve the legibility and structure of the code. In our lecture, it is par-
ticularly important to review if the code was written in the functional style. Our
team usually goes over more that 60 submissions, for each of the 4 homework that
we give in the Functional Programming lecture. Although code-review is highly ben-
eficial for students, conducting it comprehensively can be extremely time-consuming
and resource-intensive for our team. Our tool relying on ChatGPT can automate most
of this process, with high accuracy. Code reviewers can select dedicated parts of the
homework under review (e.g. functions with specific names), then formulate questions
such as the ones previously illustrated. Our tool will parse the desired code fragment
from each submitted homework and combine it with the question we are targeting.
Next, it will perform a sequence of ChatGPT queries, one for each homework, and re-
trieve each answer. These can subsequently be subject to human review before being
submitted as feedback to students. We supply more details on this in Section 4.
We discuss the implications of our observations, as well as ways for mitigating the
use of ChatGPT and other similar tools, in Section 5. In Section 6 we review major
trends in educational tools for assessment and feedback prior to ChatGPT, and relate
them to our own classroom experiences and practices. In Section 7 we review related
work and in Section 8 we conclude and provide future directions.
3
I have never heard about
7.2% 7.2% 6 months ago Yes, at most 5 times.
25.4% 22.7%
25.4% 22.7%
(a) (b)
Figure 1. Survey results: answers to questions (a) When did you hear about generative AI? and (b) How
many times have you used generative AI for homework or other school activities?
I don't know.
16.6% I don't know.
No
16.6%
28.7% 13.8% 13.8%
28.7% Yes, to a small extent.
No 33.7% 33.7%
13.8% 13.8%
Yes, to a large extent. 21.5%
21.5%
(a) (b)
Figure 2. Survey results: answers to questions (a) Has generative AI helped you gain a better understanding
of curricula? and (b) Do you believe generative AI has good accuracy?
AI has good or perfect accuracy, meaning that it generates code which is correct and
compiles. At the same time, a third of responders believe such tools to have low accu-
racy and 13% believe they are not accurate at all (Figure 2 (a)). Finally, when asked
whether generative AI has been helpful in solving programming assignments, 42% of
respondents believe it has been helpful, while 40% believe it has provided little to no
assistance (Figure 2 (b)).
The survey clearly indicates that tools such as ChatGPT are utilized on a regular
basis by students in writing homework and pursuing exams. For this reason, we would
like to gain a better understanding into the accuracy and practicality of generative AI
tools, in particular ChatGPT, in providing correct but also useful results for students.
3. ChatGPT Evaluation
Hard
19.4% Easy Complex Large sol
31.9% 31.9% 11.8%
Medium sol
23.7%
Simple Small sol
Medium 68.1% 64.5%
48.6%
Figure 3. Breakdown of our dataset into: (a) Hard, medium and easy exercises (b) Exercises with sim-
ple/complex statements (c) small, medium or large-sized solutions.
4
Our dataset consists of the entire corpus of exercises spanning 7 labs from the lecture
“Functional Programming” (FP) taught in the Scala programming language (designed
by Odersky (2004)) during the second year of a Computer Science engineering degree
program at UPB. It contains 72 coding exercises, which we grouped into three cate-
gories: easy (23 exercises), medium (35 exercises) and hard (14 exercises), based on
our teaching experience from previous years.
We tagged each exercise as having a simple (49 exercises) or complex statement
(23 exercises). Simple statements are single-phrased and may be followed by a simple
function signature. Complex statements are multi-paragraph, may contain several code
snippets which are relevant for the solution or alternatively may contain mathematical
equations described using ASCII symbols (e.g. x 1 + 1 = x 2). Many of our exercises
rely on templates, wherein the students are required to fill-out a function signature or
a code structure provided to them. An example of such a template is the following:
5
def take(n: Int, l: List[Int]): List[Int] = {
def go(n: Int, result: List[Int], l: List[Int]): List[Int] =
(n, l) match {
case (0, _) => result.reverse
case (_, Nil) => result.reverse
case (n, h :: tail) => go(n - 1, h :: result, tail)
}
go(n, Nil, l)
}
The function take creates a new list containing only the first n elements of list l.
The natural solution to implement take is using simple recursion. The solution given
by ChatGPT relies on a helper function go, which is tail-recursive and has the property
of extracting elements in reverse order. For this reason, before returning the result, the
list must be reversed. This behavior is not yet understood at this stage of our lecture,
and students not familiar with tail-recursion may have a hard time understanding why
reversal is necessary. This type of an illegible solution (although correct and effective)
will not help students in comprehending the task and may result in them incorrectly
assuming that reversal is a necessary component of the take implementation.
The second example is a solution for a task related to the main diagonal of a matrix.
The two nested for’s (over i and j) generated by ChatGPT are not necessary and are
inefficient. They should be replaced by a single for loop ranging over the line/column
of the matrix.
6
ChatGPT results Total number of exercises ChatGPT results Total number of tests
80 80
60 60
40 40
20 20
0 0
Correct Correct after follow-up Illegible Test correct Test correct after followup
(a) (b)
Figure 4. Evaluation results: (a) ChatGPT exercise correctness rates and (b) correct test generation rates
(a)
40
30
20
10
0
Simple statement Complex statement
(b)
Figure 5. Correctness of ChatGPT solutions per statement complexity, expressed as: (a) percentages (b) as
absolute values from the dataset
mathematical facts. One such example is finding the divisors of 7 out of a range of
integers. When asked if 7 is a divisor of 21 (7 was generating an incorrect test in one
exercise), ChatGPT was unable to reply correctly. The answers to follow-up questions
did not help in improving the answer.
We were interested to see whether ChatGPT’s correct answers correlate with fea-
tures of our exercises. We discovered that, for exercises involving complex statements,
the initial correct answer rate was 52%, which increased to 78% after follow-up ques-
tions. On the other hand, for exercises with basic statements, the correct answer rate
was 75%, improving to 89% after follow-up questions (Figure 5). Another interesting
correlation to explore is that with student difficulty. ChatGPTs performance is almost
indistinguishable at the easy and medium levels, however ChatGPT is sensibly less
able to answer hard questions correctly (Figure 6). These questions contain subtleties
in code which ChatGPT is unable to capture. One such example is related to an
overflow/underflow during division in the following program statement:
For very small values of a and x, the value of the left-hand-side of the expression
7
Easy exercises Medium exercises Hard exercises
Correct in the first reply 69% 71% 57%
Correct after follow-up questions 91% 88% 71%
(a)
30
20
10
(b)
Figure 6. Correctness of ChatGPT solutions per exercise difficulty, expressed as: (a) percentages (b) as
absolute values from the dataset
may become very large, thus incorrectly validating the condition and triggering a loop.
ChatGPT is unable to identify the problem with this code even when it is explicitly
prompted in a reply.
Finally, we looked at the correlation of correctness with the solution size. As be-
fore, we observe that there is no sensible correlation here (Figure 7). ChatGPT is
equally capable in generating small and medium code solutions. Out of 5 exercises
with an expected large solution, only 2 received a first-answer solution. After follow-
up questions one more correct solution was given. Due to the small number of such
exercises, it is hard to draw any conclusions regarding ChatGPT’s ability to generate
larger code. While performing this study we also asked ChatGPT a lot of questions
about qualitative aspects of the code it generated. We noticed that it can correctly
establish if a given piece of code is written in functional style, if a given function is
recursive/tail-recursive or not. This motivated us to experiment with writing a tool
which our teaching team could use to semi-automatically generate code reviews over
student homework. This tool is described in detail in Section 4.
8
Small-size solutions Medium-size solutions
Correct in the first reply 71% 60%
Correct after follow-up questions 85% 94%
(a)
40
30
20
10
0
Small solution Medium solution
(b)
Figure 7. Correctness of ChatGPT generated solutions per size (lines of code), expressed as: (a) percentages
(b) as absolute values from the dataset
9
academic year.
Using our tool, we have queried select pieces of code from 67 homework assignments
and asked whether they have been written in functional style. The code pieces ranged
in size from 1-2 to a record of 55 lines of code, for the same implementation task.
Approx. 77% of the queries received complete and correct answers, containing valid
argumentation. Around 7% of the answers where overall correct but contained some
parts of argumentation that had flaws. For instance, the claim generated by Chat-
GPT: “the use of return statements to exit early from the function is not idiomatic of
functional programming” is correct in principle but did not apply to the code piece
under scrutiny. Finally, around 15% of the answers were incorrect.
Our tool was written in Scala. It uses the ScalaMeta (2023) parser to isolate partic-
ular pieces of code (e.g. function implementations) from each homework submission. It
then assembles the code as well as the question under scrutiny into a ChatGPT query.
Finally, it uses the Scala OpenAI Client (2023) which interacts with the OpenAI API
using HTTP requests, to submit the query to ChatGPT and retrieve the answer.
5. Discussion
10
5.2. Mitigating ChatGPT in coding assignments
As previously shown, our assessment of ChatGPT’s utilization is based exclusively
on the responses gathered from students who participated in the survey outlined in
Section 2. At the same time, ensuring good academic ethics is our priority, which
requires identifying strategies for mitigating the potential misuse of ChatGPT.
We have not encouraged ChatGPT usage in our lab. In the absence of robust pla-
giarism tools, our strategy has been to require students to explain, as well as rewrite
small, one-line pieces of code under our supervision, in order to acquire points for a
homework. In this latter part, we modify one line of code producing a compile error,
and ask students to fix it.
This aspect raises interesting guestions, also highlighted by Fincher (2019) in Chap-
ter 14, such as: (i) is the ability of understanding, adapting and altering existing code,
whether computer-generated or not, a satisfactory skill for graduating a programming
lecture? (ii) from an educational standpoint, is it a beneficial and sustainable practice
to learn coding by refining and amending initial code examples, as a complement to the
more traditional approach of writing code entirely from scratch? Although these ques-
tions receive diverse responses, each supported with equally compelling arguments, we
abstrain from adopting a stance on them, as addresing them lies beyond the scope of
this paper.
What we have observed is that students who successfully solved our one-line mod-
ification task tend to perform well in our exam and achieve great results in their lab
work. It’s worth mentioning that during lab sessions, students work on exercises un-
der the guidance of a tutor, hence ChatGPT usage is not likely there. To conclude,
our evaluation practice on code ownership aligns quite well with the overall academic
student performance in our lecture. Students who have tackled assignments through
unethical means typically opt out of presenting their work, resulting in their homework
not being graded.
11
a tool requires a substantial reevaluation of our programming curricula and teaching
style. When students rely even more on AI-generated code snippets, new program-
ming skills need to be emphasized. We enumerate a few: breaking a big programming
task or algorithm into different parts, better and more in-depth test-writing skills as
well as improved code-reading and understanding skills. We believe that this direction
is promising an deserves a more comprehensive exploration in future research. This
is supported by the usage of Copilot in the programming industry, that has gained
momentum. We discuss this in more detail in Section 7.2.
Programming has continuously progressed hand in hand with the advancement of its
accompanying development tools. From its inception, essential tools like linkers and
compilers have played a pivotal role. For example, Grace Hopper’s COBOL language
evolved as a novel tool allowing programmers to express programms using human-
readable instructions rather than laboring with low-level assembly code (Beyer (2009)).
In modern times, programming heavily leans on advanced tools such as Integrated
Development Environments (IDEs) which are indispensable for orchestrating complex
build systems necessary for large-scale applications. Being aware, accustomed and
ultimately profficient in using such tools is an ever important task for the student
programmer.
Nodaways, the landscape of educational tools relevant for teaching programming
extends far more and includes vizualization tools that are suitable for the illustration
of both algorithm and code execution such as that presented by Sirkia (2009), learning
environments such as Massive Open Online Course platforms (MOOCs), and program
development tools which are relevant to both students and senior programs alike (e.g.
DrScheme - Findler (1997)).
Reviewing all these approaches is outside the scope of this paper. Instead, we will
focus our attention on two types of tools that are pertinent to our work: (i) tools em-
ployed by educators to assess aspects such as correctness and coding style in homework
submissions and (ii) tools design to assist students in their homework development.
These tools can range from straightforward tests made publicly available to students,
to more sophisticated testing and grading environments. Sometimes, such tools can
have significant overlap with (i).
12
a manner similar to AUTOMARK. Also, the much more recent PASS is designed
for C programs, which are notably more intricate compared to the Fortran programs
targeted by AUTOMARK.
There is a clear shift from simplistic test-based homework assessment towards a
greater emphasis on code quality, encompasing both style and efficiency. This tran-
sition is evident in tools such as that of Saikkonen (2001), and modern tools such
as ScalaStyle (2019). The work of Saikkonen (2001) assesses coding style in homework
submissions written in the functional language Scheme, while Scalastyle achieves a
similar objective in the functional language of Scala.
13
used in specific parts of the homework; (c) whether some parts of the code can be
rewritten using functional composition; (d) if the implementation was suitably broken
into smaller parts. From all of the above, only (d) could be properly addressed using
existing methods (a variant of cyclomatic complexity could be used for this task).
Furthermore, we would like to make this tool publicly accessible to students, not
just evaluators, serving as an assistant to direct students towards a code structure
that is efficient, follows functional programming principles and maintains readability.
We anticipate that in the near future, Large Language Models (LLMs) will pave the
way for such tools in computing education, in several directions. One of these direc-
tions involves generating automated feedback, particularly regarding coding quality.
Additionally, tools that seamlessly integrate with popular IDEs like Copilot, aiding
students in programming tasks, will significantly influence how both programming
and teaching are conducted.
7. Related work
There is a large public debate covering the future of programming after ChatGPT,
such as Glen (2023). The author argues that, while ChatGPT or Copilot may take over
certain repetitive coding tasks, it will not eliminate the need for programmers. Com-
plex programming endeavors such as devising design patterns for programs, creating
new algorithms, tackling program computational complexity remain firmly outside the
expertise of any generative tool.
14
The work of Borji (2023) examines the common errors made by ChatGPT when pre-
sented with various queries, including those related to programming. Although this
study does not primarily deal with evaluating ChatGPTs code generation capabili-
ties, it does shed light on pertinent issues, such at ChatGPT’s limitations in tackling
intricate programming tasks, its inability to deduce straightforward algorithmic solu-
tions for convoluted programming statements and its inability to deduce mathematical
identities. These finding are very much in line with our own observations, as shown in
Section 3.2.
Sobania (2023) studies ChatGPT’s bug-fixing capabilities with promising results.
Their work uses ChatGPT in order to fix bugs in Python programs, as part of the
Automated program repair (APR) line of research. While their setting is different from
ours, a bug-fixing task is very much similar, conceptually, to a programming exercise.
Sobania (2023) reports 31 fixed bugs by ChatGPT out of a total of 40, which is a
77.5% repair rate, quite similar to our reported accuracy.
15
of Python code generation from documentation, in a software development setting.
Chen (2021) is focused on fine-tuning existing models to achieve better performance,
reports similar performance and highlights similar risks such as over-reliance on AI-
generated solutions where some code snippets generated by Codex may appear to be
correct but are, in fact, flawed.
The study of Denny (2023) shifted focus from productivity to evaluating Copilot’s
performance on a publicly accessible dataset comprising 166 programming problems.
They found that Copilot successfully solves a around half of these problems on its very
first attempt, and further solves the 60% of the remaining problems solely through nat-
ural language adjustments to the problem description. The success rate of Copilot as
reported by Denny (2023), even when evaluated on imperative programming languages
such as Java, Python or C++, closely resembles the results that we have observed.
Their work aligns with our obsevations from Section 5.3, emphasizing the significance
of prompt engineering, i.e. the iterative approach of generating and refining code so-
lutions from an initial template, until arriving at a correct solution. This emerging
programming methodology holds promise as novel approach to code learning as well
as development. However, it requires a more comprehensive evaluation. As both Denny
(2023) as well as our observations suggest, its educational impact deserves more at-
tention, with noteworthy concerns of over-reliance on AI-generated code by beginner
programmers.
The paper of Kazemitabaar (2023) performs a study aimed at assessing how students
improve when using Copilot. A total of 69 beginner programmers were tasked with
completing 45 coding assignments in Python. Half of the participants used Copilot
while the other didn’t. The study should an enhancement in the completion rate, with
the Copilot users achieving a 1.15x improvements, and an even more substantial 1.8x
improvement in their evaluation scores compared to those who did not use Copilot.
The work conducted by Finnie (2023) closely aligns with our research goal as it fo-
cuses on evaluating Codex’s performance in responding to code-writing questions from
CS1 and CS2 examinations. The experimental findings presented by Finnie (2023) re-
markably parallel our own in the following respects: (i) Codex’s score exhibit consistent
decline as the length of generated code increases. In simpler terms, when the solution
becomes larger it is more likely to be incorrect. (ii) The Codex performance reported
by Finnie (2023) is around 78% on the CS1 exam and 58% in the CS2 exam, with an
average of 68% across exams, which is exactly the same overall success-rate we have
observed on ChatGPT. The result is remarkably similar given the differences between
experimental settings: the CS1 and CS2 exams are given in Python using impera-
tive and object-oriented coding style and concepts, while our exercises are written in
Scala, emphasizing functional, programming principles. Additionally, our exams take
the form of lab exercises and a portion of them are larger in both scale and complexity
often requiring more time to solve. The similarity of these results to our own suggests
that both Codex and GPT 3 models exhibit a robust level of accuracy across beginner
and intermediate programming lectures regardless of the programming paradigm.
The work of Wermelinger (2023) also reports on the usage of Copilot as a possible
tool to assist students in solving lab exercises. It confirms previous observations that
the first solution generated by Copilot is more likely to be correct. While there is no
dataset for accuracy evaluation of Copilot in his work, Wermelinger highlights the
abundance of innacurate Copilot solutions.
16
7.3. Generating explanations
The work of MacNeil (2023) provides interesting insights similar to our observations
from Section 4. Instead of generating code reviews, the work of MacNeil (2023) is
focused on generating explanations for code snippets in programming materials. Three
different types of explanations were generated in this study: line-by-line explanations,
an enumeration of concepts captured by the code-piece under scrutiny and a general
overview of the code’s functionality. MacNeil (2023) concludes that explanations were
beneficial for students, however, as far as we could observe, there is no reporting of
the accuracy or correctness of the LLM-generated explanations.
8. Conclusion
The evolving landscape of research highlights the great potential of LLMs in the realm
of programming. It also motivates further exploration to gain more insight into the
impact of ChatGPT or Copilot on programming education.
To date, all existing educational research has primarily concentrated on simple,
imperative-style programming, predominantly in languages such as Python, occasion-
ally in Java or C/C++. To our knowledge, no investigation except the present one
looked at Functional Programming Languages such as Scala, and how these tools per-
17
form in such contexts. While our work is an initial step, more are needed to get a
comprehensive view of LLMs potential and shortcomings.
Existing plagiarism tools, whether traditional or novel, face limitations in recognis-
ing AI-generated code. More importantly in our view, there is scarcity amoung studies
exploring how students actually use ChatGPT or Copilot. Also, there exists a great
potential in using such tools for generating valuable educational content.
We plan on pursuing the former two directions. More concretely, we plan on organ-
ising a new course centered around functional programming that integrates Copilot as
a programming tool. Also, we plan on developing an automated process for generating
coding style reviews for students, a valuable resource for our future lectures. With
these two directions, we aim to lay groundwork for novel approaches in programming
education ultimately helping our students become more skilled and productive in the
ever-changing landscape of program development.
Notes on contributors
References
Beyer, K. (2009). Grace Hopper and the Invention of the Information Age. MIT Press. ISBN
978-0262013109.
Borji, A. (2023) A Categorical Archive of ChatGPT Failures. Retrieved sept. 2023.
https://arxiv.org/abs/2302.03494
ChatGPTDataset. (2023). Dataset used in our evaluation. Available at:
https://github.com/pdmatei/ChatGPTinEducation
Chen, M., Tworek, J., Jun, H. Yuan, et al. (2021). Evaluating Large Language Models Trained
on Code. https://arxiv.org/abs/2107.03374.
Denny, P., Kumar, V., Giacaman, N., Conversing with Copilot: Exploring Prompt Engineering
for Solving CS1 Problems Using Natural Language In SIGCSE 2023: Proceedings of the
54th ACM Technical Symposium on Computer Science Education. March 2023.
Edwards, S. H. (2003). Rethinking computer science education from a test- first perspective.
In Companion of the 18th Annual ACM SIGPLAN Conference on Object-Oriented Pro-
gramming, Systems, Languages, and Applications (pp. 148– 155). New York: ACM.
Elsen-Rooney, M. (2023). NYC education department blocks ChatGPT on school devices, net-
works. Retrieved ian 2023. https://ny.chalkbeat.org/2023/1/3/23537987/nyc-schools-ban-
chatgpt-writing-artificial-intelligence
Fincher, S., Robins, A. (2019) The Cambridge handbook of computing education research. Cam-
bridge University Press.
Findler, R. B., Flanagan, C., Flatt, M., Krishnamurthi, S., Felleisen, M. (1997) DrScheme: A
pedagogic programming environment for Scheme. In International Symposium on Program-
ming Language Implementation and Logic Programming (pp. 369– 388). Berlin, Germany:
Springer.
Finnie-Ansley J., Denny P., et al. (2023) My AI Wants to Know if This Will Be on the Exam:
Testing OpenAI’s Codex on CS2 Programming Exercises. In ACE ’23: Proceedings of the
25th Australasian Computing Education Conference. January 2023.
18
Frieder, S., Pinchetti, L., Griffiths, R., Salvatori, T., Lukasiewicz, T., Petersen,
P.C., Chevalier, A., Berner, J. (2023). Mathematical Capabilities of ChatGPT.
https://arxiv.org/abs/2301.13867
Glen, S. (2023). ChatGPT writes code, but won’t replace developers. Retrieved ian 2023.
https://www.techtarget.com/searchsoftwarequality/news/252528379/ChatGPT-writes-
code-but-wont-replace-developers
GPTZero. (2023). Generative AI text detector Retrieved ian 2023. https://gptzero.me/team
Hollingsworth, J. (1960). Automatic graders for programming classes. Communications of the
ACM, 3(10), 528– 529.
Jackson, D., Usher, M. (1997). Grading student programs using ASSYST. ACM SIGCSE Bul-
letin, 29(1), 335– 339.
Jalil, S., Rafi, S., LaToza, T. D., Moran, K., Lam, W. (2023). ChatGPT and Software Testing
Education: Promises and Perils. https://arxiv.org/abs/2302.03287
Kazemitabaar, M., Chow, J., Ka To Ma, C., Ericson, B., Weintrop, D., Grossman., T. (2023)
Studying the effect of AI Code Generators on Supporting Novice Learners in Introductory
Programming. In Proceedings of the 2023 CHI Conference on Human Factors in Computing
Systems (CHI ’23), April 23–28, 2023, Hamburg, Germany
Lambeets, K.. (2023). How should Radboud University handle ChatGPT? Retrieved ian. 2023,
https://www.voxweb.nl/english/how-should-radboud-university-handle-chatgpt
Lau, S., Guo, P., (2023) From ”Ban It Till We Understand It” to ”Resistance is Futile”:
How University Programming Instructors Plan to Adapt as More Students Use AI Code
Generation and Explanation Tools such as ChatGPT and GitHub Copilot. In Proceedings
of the 2023 ACM Conference on International Computing Education Research V.1 (ICER
’23 V1), August 7–11, 2023, Chicago, IL, USA.
MacNeil, S., Tran, A., Hellas, A., Kim, J., Sarsa, S., Denny, P., Bernstein, S., Leinonen, J.
(2023) Experiences from Using Code Explanations Generated by Large Language Models
in a Web Software Development E-Book. In SIGCSE 2023: Proceedings of the 54th ACM
Technical Symposium on Computer Science Education. March 2023.
Mitrović, S., Andreoletti, D., Ayoub, O. (2023). ChatGPT or Human? Detect and Explain.
Explaining Decisions of Machine Learning Model for Detecting Short ChatGPT-generated
Text. https://arxiv.org/abs/2301.13852
Odersky, M. & al. (2004), An Overview of the Scala Programming Language (IC/2004/64) ,
Technical report, EPFL Lausanne, Switzerland.
OpenAI-ChatGPT (2023). ChatGPT. Retrieved ian 2023. https://openai.com/blog/chatgpt
OpenAI-Client (2023). Scala Client. 2023 version. https://github.com/cequence-io/openai-
scala-client
Peng, S., Kalliamvakou, E., Cihon, P., Demirer, M. (2023). The Impact of AI on Developer
Productivity: Evidence from GitHub Copilot. https://arxiv.org/abs/2302.06590
Popovici, M. (2022). Functional Programming lecture, taught in 2022.
https://ocw.cs.pub.ro/ppcarte/doku.php?id=fp2022
Redish, K. A., Smyth, W. F. (1986). Program style analysis: A natural by-product of program
compilation. Communications of the ACM, 29(2), 126– 133.
Saikkonen, R., Malmi, L., Korhonen, A. (2001). Fully automatic assessment of programming
exercises. ACM SIGCSE Bulletin, 33(3), 133– 136.
ScalaMeta (2023). Parser API for Scala. Version 2023.
https://github.com/scalameta/scalameta.
ScalaStyle (2019). https://github.com/scalastyle/scalastyle/wiki. Retrieved Sept. 2023.
Sirkiä, M., (2009) Jsvee & Kelmu: Creating and tailoring program animations for computing
education. In 2016 IEEE Working Conference on Software Visualisation (VISSOFT) (pp. 36
- 35), New York.
Sobania, D., Briesch, M., Hanna, C., Petke, J. (2023). An Analysis of the Automatic Bug
Fixing Performance of ChatGPT. 2023. https://arxiv.org/abs/2301.08653
Shani, I. (2023). Survey reveals AI’s impact on the developer experience. Retrieved aug 2023.
https://github.blog/2023-06-13-survey-reveals-ais-impact-on-the-developer-experience/
19
Stokel-Walker, C. (2023). ChatGPT listed as author on research papers: many scientists dis-
approve. In Nature Articles. Retrieved ian 2023. https://www.nature.com/articles/d41586-
023-00107-z
Thorburn, G., Rowe, G. (1997). PASS: An automated system for program assessment. Com-
puters & Education, 29(4), 195– 206.
Vaithilingam, P., Zhang, T., Glassman, E. (2022). Expectation vs. Experience: Evaluating the
Usability of Code Generation Tools Powered by Large Language Models. In ACM CHI EA
’22. 2022.
Zaremba, W., Brockman, G., OpenAI. (2021) OpenAI Codex. Retrieved aug. 2023.
https://openai.com/blog/openai-codex/
Vahid, F., Areizaga, L., Pang, A. (2023) ChatGPT and Cheat Detection in CS1 Using a
Program Autograding System. Retrieved aug. 2023. https://www.zybooks.com/chatgpt-and-
cheat-detection-in-cs1-using-a-program-autograding-system/
Wermelinger, M., (2023) Using GitHub Copilot to Solve Simple Programming Problems. In
SIGCSE 2023: Proceedings of the 54th ACM Technical Symposium on Computer Science
Education. March 2023.
20