Generative AI For Effective Software Development: Anh Nguyen-Duc Pekka Abrahamsson Foutse Khomh
Generative AI For Effective Software Development: Anh Nguyen-Duc Pekka Abrahamsson Foutse Khomh
Pekka Abrahamsson
Foutse Khomh Editors
Generative AI
for Effective
Software
Development
Generative AI
for Effective
Software
Development
Anh Nguyen-Duc • Pekka Abrahamsson •
Foutse Khomh
Editors
Generative AI
for Effective
Software
Development
Editors
Anh Nguyen-Duc Pekka Abrahamsson
Department of Business and IT Tampere University
University of South-Eastern Norway Tampere, Finland
Bø I Telemark, Norway
Foutse Khomh
Polytechnique Montréal
Montréal, QC, Canada
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
v
vi Preface
The advanced machine learning that powers generative AI-enabled products has
been decades in the making. But since ChatGPT came off the starting block in late
2022, new iterations of generative AI technology have been released several times
a month. In March 2023 alone, there were six major steps forward, introducing
significant improvement for software engineering-related innovation.
The future of software engineering holds exciting promises with the integration
of generative AI. As technology continues to advance, AI-driven systems are poised
to revolutionize the way software is designed, developed, and maintained. One of
the most compelling visions is the concept of collaboration between AI agents
and software engineers. In this vision, generative AI will work alongside human
developers, assisting them in various aspects of the software development lifecycle.
These AI collaborators will have the capability to understand natural language
requirements, generate code based on high-level descriptions, and even help in
debugging and optimizing code. This collaborative partnership between human
developers and AI is expected to significantly accelerate the software development
process, reduce errors, and open up new possibilities for innovation.
Another key aspect of the future of software engineering is the potential for
AI to automate routine and time-consuming tasks. Generative AI models can
generate code snippets, templates, and even entire modules based on patterns
and best practices learned from vast repositories of code. This automation will
allow developers to focus on more creative and complex aspects of software
development, such as requirement elicitation, crafting user-centric interfaces, high-
level architectural decisions, and ethical considerations and compliance. In 2022,
GitHub reported on the impact of Copilot on developer productivity and happiness.
Eighty-eight percent of survey respondents replied that they feel more productive
with Copilot. Moreover, 74% say they are faster with repetitive tasks and can
focus on more satisfying work. We envision that the positive impact of tools like
Copilot will expand, benefiting various types of projects and adapting to diverse
organizational environments.
Furthermore, generative AI in software engineering will enable the creation of
highly personalized and adaptive software systems. These AI-driven applications
will be capable of learning from user interactions and preferences, continuously
evolving to meet the changing needs of their users. For instance, in the realm of
user interface design, AI can generate user interfaces that are tailored to individual
preferences and accessibility requirements. This level of personalization will lead
to more engaging and user-friendly software experiences, ultimately enhancing user
satisfaction and the overall quality of software products. In essence, the future of
software engineering with generative AI holds the promise of increased productivity,
Preface vii
improved software quality, and the creation of highly customized and adaptive
software solutions.
Another significant impact of generative AI in software engineering will be
in the realm of personalized software solutions. AI will enable the creation of
highly customized software that can adapt to the specific needs and preferences
of individual users or businesses. This will be made possible by AI’s ability to
learn from user interactions and evolve over time. The software will become more
intuitive and user-friendly, as it will be able to anticipate user needs and offer
solutions proactively. This level of personalization will not only enhance user
experience but also open up new avenues for innovation in software design and
functionality.
To provide a complete perspective, it’s essential to acknowledge the existing
challenges of generative AI technologies, which currently stand as key areas of
focus for research and development among software engineering scholars and
professionals. In general, large language models (LLMs) are still easy to produce
hallucination, misleading, inconsistent, or unverifiable information. Models built
on top of historically biased data pose problems of fairness and trustworthiness,
and when incidents happen, issues about safety and responsibility can arise. LLMs
may fall short of mastering generation tasks that require domain-specific knowledge
or generating structured data. It is nontrivial to inject specialized knowledge into
LLMs. Techniques like prompting, augmenting, fine-tuning, and the use of smaller
AI models present potential solutions, yet applying them to specific problems is
nontrivial. For software engineering tasks, the current evaluation of generative
AI focuses more on code generation tasks, with less emphasis on evaluating or
researching other tasks such as requirement generation, code fixes, and vulnerability
repair. It is anticipated that exploring these areas will be a significant and influential
line of research for the software engineering community.
Finally, generative AI will also play a crucial role in democratizing software
development. With AI-assisted coding, individuals who may not have formal
training in programming will be able to create and modify software. This will
break down barriers to entry in the field of software development and foster a more
inclusive and diverse community of software creators. It will empower a wider
range of people to bring their unique ideas to life, leading to a more vibrant and
innovative software landscape. This democratization will not only spur creativity
but also lead to the development of solutions that cater to a broader spectrum of
needs and challenges in various industries.
in real-world scenarios, offering insights into their practical efficacy, limitations, and
impact on various aspects of software engineering. It focuses on the human aspect,
examining how generative AI influences the roles, collaborations, and decision-
making processes of developers from different countries and cultures. By presenting
case studies, surveys, and interviews from various software development contexts,
the book aims to offer a global perspective on the integration of generative AI,
highlighting how these advanced tools are adapted to and influence diverse cultural,
organizational, and technological environments. This multifaceted approach not
only showcases the technological advancements in generative AI but also deeply
considers the human element, ensuring that the narrative remains grounded in
the practical realities of software developers worldwide. While generative AI
technologies encompass a wide range of data types, our cases focus mainly on LLMs
with text and code generation. The evaluation is done with current models, such as
Llama 2 or ChatGPT-4, acknowledging the current limitations associated with them.
application in code generation tasks (Chap. 7). Dang Nguyen Ngoc Hai et al. explore
the CodeBERT approach for automatic program repair of security vulnerabilities
(Chap. 8). Väinö Liukko and colleagues present a case study of ChatGPT as a full-
stack Web developer (Chap. 9).
Part IV examines how generative AI is reshaping software engineering pro-
cesses, from collaboration and workflow to management and agile development.
To start with, Rasmus Ulfsnes and co-authors provide empirical insights on how
generative AI is transforming collaboration and workflow in software development
(Chap. 10). Beatriz Cabrero-Daniel, Yasamin Fazelidehkordi, and Ali Nouri discuss
the enhancement of software management with generative AI (Chap. 11). Dron
Khanna and Anh Nguyen Duc conduct a survey study on the value-based adop-
tion of ChatGPT in Agile software development among Nordic software experts
(Chap. 12). Guilherme Pereira and colleagues share early results from a study of
generative AI adoption in a large Brazilian company, focusing on the case of Globo
(Chap. 13).
Part V is about future directions and education. The final section of the book
looks toward the future, exploring emerging trends, future directions, and the
role of education in the context of generative AI. Shunichiro Tomura and Hoa
Dam discuss generating explanations for AI-powered delay prediction in software
projects (Chap. 14). Mohammad Idris Attal and team present a prompt book for
turning a large language model into a start-up digital assistant (Chap. 15). Mika
Saari and colleagues explore effective approaches to utilize AI tools in programming
courses, guiding students in this new era of AI-assisted software development
(Chap. 16).
This book is not a technical manual on how to code with generative AI tools.
The book is also not about customizing or developing generative AI models
but rather their application in software engineering. The book offers a strategic,
managerial, and process-centric viewpoint, highlighting how generative AI can
be a potentially in different software development activities, irrespective of the
programming language, software technologies, or development framework.
While this book presents various empirical applications of generative AI in
software development, it is not an exhaustive guide on all aspects of software
engineering. It is, however, a crucial read for anyone interested in understanding
how generative AI is revolutionizing software development and what it means for
the future of this field.
The book offers diverse perspectives as it compiles research and experiences
from various countries and software development environments, reflecting a global
view of generative AI’s impact. The book offers non-technical discussions about
generative AI in management, teamwork, business, and education.
x Preface
6 Acknowledgments
This book would not have been possible without the massive collaborative effort
of our reviewers, authors, and editors. The insights encapsulated within these
pages are a product of the knowledge and experiences shared by many software
engineering researchers and practitioners. Although the authors and editors are
specifically acknowledged in each chapter or callout, we’d like to take time to
recognize those who contributed to each chapter by providing thoughtful input,
discussion, and review. We extend our gratitude to Khlood Ahmad, Christian
Berger, Beatriz Cabero-Daniel, Ruzanna Chitchyan, John Grundy, Eduardo Guerra,
Helena Holstrom Olsson, Zoe Hoy, Ronald Jabangwe, Marius Rohde Johannessen,
Dron Khanna, Foutse Khomh, Dae-Kyoo Kim, Johan Linåker, Jorge Melegati,
Anh Nguyen Duc, Amin Nikanjam, Dimitris Polychronopoulos, Tho Quan, Usman
Rafiq, Viktoria Stray, Ingrid Sunbø, Rasmus Ulfsnes, Hironori Washizaki, and Jules
White.
This book represents a collaborative effort that extends beyond the boundaries
of any single institution or discipline. We are profoundly grateful to the numerous
contributors whose expertise, insights, and unwavering dedication have been instru-
mental in bringing this project to fruition:
• Norwegian University of Science and Technology, Norway
• University of Oslo, Norway
• University of South Eastern, Norway
• SINTEF, Norway
• Chalmers University of Technology, Sweden
• University of Gothenburg, Sweden
• Volvo Cars, Sweden
• Solita Ltd., Finland
• Tampere University, Finland
• Free University of Bozen-Bolzano, Italy
• University of California Irvine, USA
• Vanderbilt University, USA
Preface xi
xiii
xiv Contents
1 Introduction
These models are artificial neural networks, with the transformer architec-
ture [92], such as Bidirectional Encoder Representations from Transformers
(BERT) [24], which are (pre-)trained using self-supervised learning and semi-
supervised learning. The transformer architecture, characterized by self-attention
mechanisms [92], serves as the fundamental building block for language modeling
tasks.
This approach has demonstrated effectiveness across broad applications, ranging
from language translation to code generation. Notable examples of these models
include OpenAI’s GPT [9] series (including GPT-3.5 and GPT-4, utilized in
ChatGPT), Google’s PaLM [13] (deployed in Bard), and Meta’s LLaMA [87].
2.1 Tokenization
Tokenization plays a crucial role in NLP by dividing documents, whether they are
text or code, into smaller units known as tokens. A token could stand for a word, sub-
word, character, or symbol, depending on the model’s type and size [4]. This process
helps models in effectively managing diverse languages and input formats [63].
There are several tokenization approaches, like WordPiece [81], Byte Pair Encoding
(BPE) [82], SentencePiece [49], and Unigram [48].
The Transformer architecture, which forms the core of LLMs, consists of an encoder
and a decoder [92]. This architecture is also known as sequence-to-sequence,
allowing it to transform an input sequence of tokens into an output sequence, as
seen in translation tasks [70, 115].
Each of these components is comprised of multiple layers of embedding,
attention, and feed-forward neural networks [37]. The encoder’s primary role is
to convert each token in the input sequence into a fixed-length vector, capturing
6 A. Moradi Dakhel et al.
diverse impacts on the learning process, and their appropriateness may vary based
on the architecture and characteristics of the processed data [30].
2.5 Prompt
A prompt is an instruction given to a trained LLM at the inference step that enables
the model to generate answers for queries it has never seen before [119]. The output
generated by LLM can adapt to the context and instructions provided in the prompt
without the need for fine-tuning or alignment. Trained LLMs can be prompted using
different setups to generate the best answers [99]. Widely used prompt setups will
be explored in Sect. 4.
3 Model Adaption
This section explores various methods for adapting an LLM to a specific down-
stream task, spanning from pre-training to resource-efficient model adaptation.
3.1 Pre-training
Pre-training in LLMs denotes the initial training phase, encompassing both self-
supervised learning, where the model predicts masked words or sequences in
unlabeled data, and semi-supervised learning, integrating labeled data to fine-
tune the model for specific tasks. The term “pre-training” is employed because it
anticipates the need for additional training or post-processing steps to adapt the
pre-trained model to the desired task [64]. A widely used pre-training objective for
LLMs is Masked Language Modeling (MLM) [16]. In this pre-training technique,
the goal is to train the model by predicting tokens that are randomly masked within
the input sequence.
3.2 Fine-Tuning
Fine-tuning involves taking pre-trained models and refining them through additional
training on smaller, task-specific labeled datasets [64]. This process adapts the
models’ capabilities and enhances their performance for a particular task or domain.
Essentially, fine-tuning transforms general-purpose models into specialized ones.
An example of such a task is fine-tuning CodeBERT for defect detection task [69].
8 A. Moradi Dakhel et al.
LLMs have the potential to generate outputs that are incorrect, harmful, or biased.
Adapting LLMs with human feedback can aid in updating the model parameters to
mitigate the occurrence of such outputs [68]. Reinforcement Learning using Human
Feedback (RLHF) is a well-known technique employed for alignment tuning in
LLMs. In RLHF, a fine-tuned model is further trained with human feedback as
a part of the reward system [120]. The RLHF process involves collecting human
feedback on the outputs of a fine-tuned model. These feedback responses are then
used to learn a reward model that can predict a numerical reward for a generated
output. Finally, the model is optimized by incorporating this reward model and
leveraging RL techniques [120]. This iterative approach of learning from human
feedback contributes to enhancing the model’s alignment and adapting the model
to avoid generating incorrect, harmful, or biased outputs. Human feedback serves
as a valuable source for refining the model’s parameters, making it more adept at
addressing complex human preferences that may be challenging to capture through
traditional reward functions [62].
1 https://github.com/tloen/alpaca-lora.
An Overview on Large Language Models 9
LLMs demonstrate an ability for In-Context Learning (ICL); meaning that they can
learn effectively from a few examples within a specific context. Studies [3, 98, 111]
show that LLMs can perform complex tasks through ICL. The fundamental concept
of ICL revolves around the model’s capacity to learn the patterns through the
examples and subsequently make accurate predictions [27]. One advantage of
ICL is the possibility of engaging in a dialogue with the model. Second, ICL
closely aligns with the decision-making processes observed in humans by learning
from analogy [100]. In contrast to traditional training and tuning approaches, ICL
operates as a training-free framework, significantly reducing the computational
costs associated with adapting the model to new tasks. Moreover, this approach
transforms LLMs into black boxes as a service that can be integrated into real-
world tasks [85]. Various ICL techniques have been proposed in the literature. In
the following section, we will discuss several well-known techniques.
Few-shot learning uses a few labeled examples in the prompt to adapt the model for a
specific task. This process involves providing contextual demonstration examples as
input/output pairs that represent the downstream task. These demonstrations serve
to instruct the model on how to reason or use tools and perform actions [3]. This
technique enables the use of the same model for various downstream tasks without
requiring tuning or changing the model’s parameters [9]. The effectiveness of this
technique relies on the relevancy of the few examples to the target task, and the
format of these examples guides the model in predicting the output format. For
instance, authors in [55] employ few-shot learning to demonstrate their method for
generating step-by-step solutions that align with the math problems in their training
data. The objective of this study is not to impart new skills to the model with few-
shot learning; instead, it aims to guide the model in generating solutions in a desired
step-by-step format.
10 A. Moradi Dakhel et al.
CoT is motivated by the natural step-by-step thinking ability of humans and has been
observed to improve the performance of LLMs in solving problems that require
multi-step reasoning [98]. The human thought process for tackling a complex
problem, such as a complex math problem, involves breaking down the problem into
intermediate tasks and solving them to reach the final answer. In CoT, the primary
task is decomposed into intermediate tasks, and the LLM then finds answers for
these intermediate tasks to resolve the main problem [98]. Another type of CoT is
the self-planning approach, which employs LLMs to break down the original task
into smaller steps termed plans [46]. The model is then invoked on these provided
steps for execution.
5 Augmented LLM
LLMs are constrained by their training data. If a user’s prompt requires domain-
specific knowledge, such as data related to a company’s Service Level Agreement
(SLA), LLMs may not deliver accurate responses in such cases.
While ICL techniques require users to provide examples in the prompt, such
as few-shot learning, augmented LLMs incorporate methods that access external
resources and tools to improve the model’s performance. This augmentation can
be integrated into LLMs either during the training or inference stage [62]. In this
section, we explore several categories of augmented LLMs.
An Overview on Large Language Models 11
The LLMs used for SE tasks often rely on open-source repositories for training and
fine-tuning [12]. Before fine-tuning the model for a specific task, there is a pre-
training step on textual data to enhance the language understanding capabilities of
the model [91]. Different studies use pre-trained LLMs either for inferring a task or
fine-tuning the pre-trained model for specific downstream tasks [26, 74]. Platforms
such as GitHub and StackOverflow provide vast code and textual data, serving as
resources for tuning LLMs for SE tasks.
Several benchmark datasets are commonly used in evaluating LLMs for diverse
SE tasks. Among them, we can point to CodexGLUE [57, 58] dataset, collected for
evaluating the general language understanding of LLMs for different code-related
tasks. This benchmark includes 14 datasets across 10 different code-related tasks
An Overview on Large Language Models 13
such as clone detection, defect detection, code completion, code translation, and
code summarization. For the test case generation task, datasets like ATLAS [97]
and Methods2Test [89] are employed to fine-tune and evaluate LLMs for generating
test cases in Java. The PROMISE NFR dataset [44], on the other hand, is used in
studies leveraging LLMs for classifying project requirements.
Datasets like Humaneval [12] and APPs [39] are also commonly used for
evaluating LLMs in tasks requiring code generation, but they often incorporate pro-
gramming competition tasks. In contrast, CoderEval [113] is a benchmark dataset
that collects programming tasks from more real-world programming scenarios.
Regarding the evaluation metrics, given the diversity of SE tasks, a single
evaluation metric may not adequately capture the performance of LLMs for different
tasks. Studies typically employ a range of metrics based on the specific problem
types. Metrics like F1-score or precision find application in tasks such as code
classification [40]. For evaluating the generative capability of LLMs, metrics such
as BLEU [96], CodeBLEU [78], Exact Match (EM) [90], and Pass@k [12] are
commonly used. Metrics like BLEU score and EM are more useful for tasks
such as code review or code summarization because the output of the model
is textual. But code generation and test generation tasks demand accuracy that
extends beyond matching ground truth. An accurate output for these types of tasks
should be compiled, effective, and implement the requirements outlined in the task
description. Thus, metrics like Pass@k, which execute code on certain test cases,
are more practical in these scenarios. In tasks like program repair, the evaluation
metric also pertains to the correctness of the code after bug repair [45].
Furthermore, different quality metrics in SE can be employed to evaluate LLM
output across different SE tasks. Metrics such as cyclomatic complexity [18], test
coverage [79], mutation score [19], code/test smells [84], and vulnerabilities [60, 72]
serve as benchmark metrics for assessing the quality of outputs generated by LLMs
in diverse SE tasks.
7 Tools or Libraries
Various libraries are available for the training, tuning, and inference of LLMs,
including Transformers [101], DeepSpeed [77], BMTrain [7], PyTorch [71], and
TensorFlow [1]. Additionally, there are tools designed to facilitate the process of
prompting LLMs and building applications with them.
LangChain [50] is a framework tailored for developing applications that leverage
LLMs. The primary concept behind this tool involves facilitating the chain of
various components around an LLM to build more advanced use cases, such as
a Chatbot. LangChain offers diverse prompt templates, short-term and long-term
memory access for retrieval setups, and interaction capabilities with different LLMs.
AutoGen [103, 104] is another framework that empowers the development of
LLM applications by employing multiple agents capable of communicating with
each other to solve different tasks. AutoGen features customizable agents with the
14 A. Moradi Dakhel et al.
core of LLM and also allows human participation and the incorporation of various
tools. The framework also supports different prompt templates.
Furthermore, Guidance [36] is a tool that enhances the effective use of various
ICL prompts, such as CoT, and simplifies the overall structure for different prompt
templates.
The GitHub repository Parameter-Efficient Fine-Tuning (PEFT) [61] provides
various efficient tuning approaches for adapting Pre-trained LLMs to downstream
applications without fine-tuning all the model’s parameters. This repository includes
LoRA [41]/AdaLoRA [116] and Prefix Tuning [52]. Additionally, it supports
numerous models such as GPT-2 and LLaMA.
Leveraging LLMs for SE tasks poses several challenges and limitations. One of
the challenges is the demand for high-quality data for effective training and tuning
of LLMs for different SE tasks. Additionally, the training and tuning processes
are resource-intensive and require significant time and computational cost. There
is also a lack of effective resource-efficient adaptation methods for LLMs. While
the literature has introduced numerous efficient tuning methods as mentioned in
Sect. 3.4, the majority of these techniques have been evaluated on small-scale pre-
trained language models rather than LLMs. As of now, there remains a notable
absence of comprehensive research examining the impact of various efficient tuning
methods on large-scale language models across diverse settings or tasks.
Various techniques have been proposed on the prompt side to adapt models for
new, unseen tasks, such as ICL. However, one of the limitations of these techniques
is the restricted amount of content that can be incorporated into the prompt because
of the context window size of LLMs.
On the other side, LLMs are limited by information and knowledge in their
training dataset, which limits their adaptability to evolving scenarios. To overcome
this limitation, various techniques, like RAG, have been proposed to augment the
new information relevant to the prompt into the LLMs either during tuning or
inference.
LLMs may also generate hallucinations when producing outputs that are plau-
sible responses but incorrect. Evaluation metrics such as the correct ratio for code
generation tasks can aid in detecting hallucinations by identifying code that fails in
certain test cases. However, LLMs may occasionally overlook specifications in the
task description, which may not be detected with test cases and need human experts
to filter them out.
Another limitation pertains to the fact that the outputs of LLMs are sometimes
buggy, inaccurate, biased, or harmful. It is necessary to filter these outputs before
presenting them to end users. Studies have employed the RLHF technique to
enhance the model’s output by rewarding good-quality responses. However, a
An Overview on Large Language Models 15
notable limitation is associated with the efforts and time required for learning a
reward model based on human feedback.
Moreover, numerous quality evaluation metrics in SE require the execution of the
code generated by LLMs, facing challenges when evaluating code that is not self-
contained and has dependencies. Exploring the training of a model that can predict
code quality could be an interesting direction to address this limitation. Leveraging
LLMs as a tool to enhance their own output, such as fixing bugs or generating
test cases to evaluate the generated code, also can be beneficial in addressing this
limitation.
LLMs also face challenges when addressing complex SE tasks. While these mod-
els perform a good performance on benchmark datasets with fewer dependencies
that share the same distribution as their training data, they may face challenges
in scalability and robustness when deployed in real-world environments, such as
software projects. The scalability challenge arises from the size and computational
cost of these models, making their deployment and real-time usage challenging. For
instance, correctly completing a single programming task may require considering
the contexts of various dependencies. As for robustness, the issue lies in the presence
of diverse data or prompts in software projects that fall out of the distribution of
the LLMs’ training data, impacting their performance in real-world environments
compared to their performance on benchmark datasets [28].
Another key concern arises from the memorization issue in LLMs, where models
generate entire sequences of tokens verbatim from their training data [6]. This
problem is triggered, for example, when the prompt precisely matches the content in
the model’s training data. Consequently, the model generates the sequence of tokens
from its training data in the output to complete the prompt rather than generalizing
it. Many benchmark datasets in SE are sourced from GitHub or StackOverflow and
are already part of the training data for LLMs. Using these benchmarks to evaluate
LLMs can impact the quality of evaluation due to the memorization issue. There is a
lack of more comprehensive datasets that are not a part of the training data of LLMs
to evaluate their performance for different SE tasks. Therefore, another potential
future direction could involve constructing benchmark datasets beyond HumanEval
to evaluate LLMs for various SE tasks.
References
1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S.,
Irving, G., Isard, M., et al.: {TensorFlow}: a system for {Large-Scale} machine learning. In:
12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp.
265–283 (2016)
2. Agarap, A.F.: Deep learning using rectified linear units (ReLU). Preprint (2018).
arXiv:1803.08375
3. Ahmed, T., Devanbu, P.: Few-shot training llms for project-specific code-summarization.
In: Proceedings of the 37th IEEE/ACM International Conference on Automated Software
Engineering, pp. 1–5 (2022)
16 A. Moradi Dakhel et al.
4. Ali, M., Fromm, M., Thellmann, K., Rutmann, R., Lübbering, M., Leveling, J., Klug, K.,
Ebert, J., Doll, N., Buschhoff, J.S., et al.: Tokenizer choice for llm training: Negligible or
crucial? Preprint (2023). arXiv:2310.08754
5. Bacciu, A., Cocunasu, F., Siciliano, F., Silvestri, F., Tonellotto, N., Trappolini, G.: Rraml:
Reinforced retrieval augmented machine learning. Preprint (2023). arXiv:2307.12798
6. Biderman, S., Prashanth, U.S., Sutawika, L., Schoelkopf, H., Anthony, Q., Purohit, S.,
Raf, E.: Emergent and predictable memorization in large language models. Preprint (2023).
arXiv:2304.11158
7. Bmtrain: Efficient training for big models (2021). https://github.com/OpenBMB/BMTrain
8. Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., Van Den Driess-
che, G.B., Lespiau, J.B., Damoc, B., Clark, A., et al.: Improving language models by
retrieving from trillions of tokens. In: International Conference on Machine Learning, pp.
2206–2240. PMLR (2022)
9. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A.,
Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Adv. Neural
Inf. Process. Syst. 33, 1877–1901 (2020)
10. Cao, J., Li, M., Wen, M., Cheung, S.c.: A study on prompt design, advantages and limitations
of chatgpt for deep learning program repair. Preprint (2023). arXiv:2304.08191
11. Chang, Y., Wang, X., Wang, J., Wu, Y., Zhu, K., Chen, H., Yang, L., Yi, X., Wang, C., Wang,
Y., et al.: A survey on evaluation of large language models. Preprint (2023). arXiv:2307.03109
12. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.d.O., Kaplan, J., Edwards, H., Burda, Y.,
Joseph, N., Brockman, G., et al.: Evaluating large language models trained on code. Preprint
(2021). arXiv:2107.03374
13. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P.,
Chung, H.W., Sutton, C., Gehrmann, S., et al.: Palm: Scaling language modeling with
pathways. Preprint (2022). arXiv:2204.02311
14. Ciborowska, A., Damevski, K.: Fast changeset-based bug localization with bert. In: Proceed-
ings of the 44th International Conference on Software Engineering, pp. 946–957 (2022)
15. Ciniselli, M., Cooper, N., Pascarella, L., Poshyvanyk, D., Di Penta, M., Bavota, G.: An
empirical study on the usage of bert models for code completion. In: 2021 IEEE/ACM 18th
International Conference on Mining Software Repositories (MSR), pp. 108–119. IEEE (2021)
16. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: Pre-training text encoders as
discriminators rather than generators. Preprint (2020). arXiv:2003.10555
17. Cohen, E., Beck, C.: Empirical analysis of beam search performance degradation in neural
sequence models. In: International Conference on Machine Learning. pp. 1290–1299. PMLR
(2019)
18. Dakhel, A.M., Majdinasab, V., Nikanjam, A., Khomh, F., Desmarais, M.C., Jiang, Z.M.J.:
Github Copilot AI pair programmer: Asset or liability? J. Syst. Software 203, 111734 (2023)
19. Dakhel, A.M., Nikanjam, A., Majdinasab, V., Khomh, F., Desmarais, M.C.: Effective test
generation using pre-trained large language models and mutation testing (2023). https://arxiv.
org/abs/2308.16557
20. Dang, V., Bendersky, M., Croft, W.B.: Two-stage learning to rank for information retrieval. In:
Advances in Information Retrieval: 35th European Conference on IR Research, ECIR 2013,
Moscow, Russia, March 24–27, 2013. Proceedings 35, pp. 423–434. Springer (2013)
21. De Boom, C., Van Canneyt, S., Bohez, S., Demeester, T., Dhoedt, B.: Learning semantic
similarity for very short texts. In: 2015 IEEE International Conference on Data Mining
Workshop (ICDMW), pp. 1229–1234. IEEE (2015)
22. Dettmers, T., Lewis, M., Belkada, Y., Zettlemoyer, L.: Gpt3. int8 (): 8-bit matrix multiplica-
tion for transformers at scale. Adv. Neural Inf. Process. Syst. 35, 30318–30332 (2022)
23. Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: Efficient finetuning of
quantized llms. Preprint (2023). arXiv:2305.14314
24. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional
transformers for language understanding. Preprint (2018). arXiv:1810.04805
An Overview on Large Language Models 17
25. Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., Weston,
J.: Chain-of-verification reduces hallucination in large language models. Preprint (2023).
arXiv:2309.11495
26. Dinella, E., Ryan, G., Mytkowicz, T., Lahiri, S.K.: Toga: A neural method for test oracle
generation. In: Proceedings of the 44th International Conference on Software Engineering,
pp. 2130–2141 (2022)
27. Dong, Q., Li, L., Dai, D., Zheng, C., Wu, Z., Chang, B., Sun, X., Xu, J., Sui, Z.: A survey for
in-context learning. Preprint (2022). arXiv:2301.00234
28. Du, M., He, F., Zou, N., Tao, D., Hu, X.: Shortcut learning of large language models in natural
language understanding: A survey. Preprint (2022). arXiv:2208.11857
29. Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., Tang, J.: Glm: General language model
pretraining with autoregressive blank infilling. Preprint (2021). arXiv:2103.10360
30. Dubey, S.R., Singh, S.K., Chaudhuri, B.B.: Activation functions in deep learning: A
comprehensive survey and benchmark. Neurocomputing 503, 92–108 (2022)
31. Floridi, L., Chiriatti, M.: Gpt-3: Its nature, scope, limits, and consequences. Minds Mach. 30,
681–694 (2020)
32. Fu, M., Tantithamthavorn, C.: Linevul: A transformer-based line-level vulnerability predic-
tion. In: Proceedings of the 19th International Conference on Mining Software Repositories,
pp. 608–620 (2022)
33. Gao, Z., Feng, A., Song, X., Wu, X.: Target-dependent sentiment classification with bert.
IEEE Access 7, 154290–154299 (2019)
34. Gim, I., Chen, G., Lee, S.s., Sarda, N., Khandelwal, A., Zhong, L.: Prompt cache: Modular
attention reuse for low-latency inference. Preprint (2023). arXiv:2311.04934
35. Goyal, T., Li, J.J., Durrett, G.: News summarization and evaluation in the era of gpt-3. Preprint
(2022). arXiv:2209.12356
36. Guidance: A programming paradigm to conventional prompting and chaining (2023). https://
github.com/guidance-ai/guidance
37. Guo, Y., Zheng, Y., Tan, M., Chen, Q., Li, Z., Chen, J., Zhao, P., Huang, J.: Towards accurate
and compact architectures via neural architecture transformer. IEEE Trans. Pattern Anal.
Mach. Intell. 44(10), 6501–6516 (2021)
38. He, H., Zhang, H., Roth, D.: Rethinking with retrieval: Faithful large language model
inference. Preprint (2022). arXiv:2301.00303
39. Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C.,
Puranik, S., He, H., Song, D., et al.: Measuring coding challenge competence with apps.
corr abs/2105.09938 (2021). Preprint (2021). arXiv:2105.09938
40. Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., Wang,
H.: Large language models for software engineering: A systematic literature review. Preprint
(2023). arXiv:2308.10620
41. Hu, E.J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA:
Low-rank adaptation of large language models. In: International Conference on Learning
Representations (2022). https://openreview.net/forum?id=nZeVKeeFYf9
42. Ippolito, D., Kriz, R., Kustikova, M., Sedoc, J., Callison-Burch, C.: Comparison of diverse
decoding methods from conditional language models. Preprint (2019). arXiv:1906.06362
43. Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J.,
Joulin, A., Riedel, S., Grave, E.: Few-shot learning with retrieval augmented language models.
Preprint (2022). arXiv:2208.03299
44. Jane Cleland-Huang, Sepideh Mazrouee, H.L., Port, D.: The promise repository of empirical
software engineering data (2007). https://zenodo.org/records/268542
45. Jiang, N., Liu, K., Lutellier, T., Tan, L.: Impact of code language models on automated
program repair. Preprint (2023). arXiv:2302.05020
46. Jiang, X., Dong, Y., Wang, L., Shang, Q., Li, G.: Self-planning code generation with large
language model. Preprint (2023). arXiv:2303.06689
47. Kheiri, K., Karimi, H.: Sentimentgpt: Exploiting gpt for advanced sentiment analysis and its
departure from current machine learning. Preprint (2023). arXiv:2307.10234
18 A. Moradi Dakhel et al.
48. Kudo, T.: Subword regularization: Improving neural network translation models with multiple
subword candidates. Preprint (2018). arXiv:1804.10959
49. Kudo, T., Richardson, J.: Sentencepiece: A simple and language independent subword
tokenizer and detokenizer for neural text processing. Preprint (2018). arXiv:1808.06226
50. Langchain: A primer on developing llm apps fast (2023). https://github.com/langchain-ai/
langchain
51. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M.,
Yih, W.t., Rocktäschel, T., et al.: Retrieval-augmented generation for knowledge-intensive nlp
tasks. Adv. Neural Inf. Process. Syst. 33, 9459–9474 (2020)
52. Li, X.L., Liang, P.: Prefix-tuning: Optimizing continuous prompts for generation. Preprint
(2021). arxiv:2101.00190
53. Li, X., Gong, Y., Shen, Y., Qiu, X., Zhang, H., Yao, B., Qi, W., Jiang, D., Chen, W., Duan, N.:
Coderetriever: A large scale contrastive pre-training method for code search. In: Proceedings
of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2898–
2910 (2022)
54. Li, Z., Lu, S., Guo, D., Duan, N., Jannu, S., Jenks, G., Majumder, D., Green, J., Svyatkovskiy,
A., Fu, S., et al.: Automating code review activities by large-scale pre-training. In: Proceed-
ings of the 30th ACM Joint European Software Engineering Conference and Symposium on
the Foundations of Software Engineering, pp. 1035–1047 (2022)
55. Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman,
J., Sutskever, I., Cobbe, K.: Let’s verify step by step. Preprint (2023). arXiv:2305.20050
56. Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y., Shi, Y., Krishnamoorthi,
R., Chandra, V.: Llm-qat: Data-free quantization aware training for large language models.
Preprint (2023). arXiv:2305.17888
57. Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., Clement, C., Drain, et al.:
Codexglue: A machine learning benchmark dataset for code understanding and generation
(2021). https://github.com/microsoft/CodeXGLUE
58. Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., Clement, C., Drain, D.,
Jiang, D., Tang, D., et al.: Codexglue: A machine learning benchmark dataset for code
understanding and generation. Preprint (2021). arXiv:2102.04664
59. Ma, X., Gong, Y., He, P., Zhao, H., Duan, N.: Query rewriting for retrieval-augmented large
language models. Preprint (2023). arXiv:2305.14283
60. Majdinasab, V., Bishop, M.J., Rasheed, S., Moradidakhel, A., Tahir, A., Khomh, F.: Assessing
the security of github copilot generated code—a targeted replication study. Preprint (2023).
arXiv:2311.11177
61. Mangrulkar, S., Gugger, S., Debut, L., Belkada, Y., Paul, S., Bossan, B.: Peft: State-of-the-art
parameter-efficient fine-tuning methods (2022). https://github.com/huggingface/peft
62. Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., Rozière, B.,
Schick, T., Dwivedi-Yu, J., Celikyilmaz, A., et al.: Augmented language models: a survey.
Preprint (2023). arXiv:2302.07842
63. Mielke, S.J., Alyafeai, Z., Salesky, E., Raffel, C., Dey, M., Gallé, M., Raja, A., Si, C., Lee,
W.Y., Sagot, B., et al.: Between words and characters: a brief history of open-vocabulary
modeling and tokenization in nlp. Preprint (2021). arXiv:2112.10508
64. Min, B., Ross, H., Sulem, E., Veyseh, A.P.B., Nguyen, T.H., Sainz, O., Agirre, E., Heintz,
I., Roth, D.: Recent advances in natural language processing via large pre-trained language
models: A survey. ACM Comput. Surv. 56(2), 1–40 (2023)
65. Mitra, B., Craswell, N.: Neural models for information retrieval. Preprint (2017).
arXiv:1705.01509
66. Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju,
V., Saunders, W., et al.: Webgpt: Browser-assisted question-answering with human feedback.
Preprint (2021). arXiv:2112.09332
67. Nashid, N., Sintaha, M., Mesbah, A.: Retrieval-based prompt selection for code-related few-
shot learning. In: Proceedings of the 45th International Conference on Software Engineering
(ICSE’23) (2023)
An Overview on Large Language Models 19
68. Naveed, H., Khan, A.U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Barnes, N., Mian, A.: A
comprehensive overview of large language models. Preprint (2023). arXiv:2307.06435
69. Pan, C., Lu, M., Xu, B.: An empirical study on software defect prediction using codebert
model. Appl. Sci. 11(11), 4793 (2021)
70. Pan, R., Ibrahimzada, A.R., Krishna, R., Sankar, D., Wassi, L.P., Merler, M., Sobolev, B.,
Pavuluri, R., Sinha, S., Jabbarvand, R.: Understanding the effectiveness of large language
models in code translation. Preprint (2023). arXiv:2308.03109
71. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z.,
Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep
learning library. Adv. Neural Inf. Process. Syst. 32, (2019). arXiv:1912.01703
72. Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., Karri, R.: Asleep at the keyboard? assessing
the security of github copilot’s code contributions. In: 2022 IEEE Symposium on Security and
Privacy (SP), pp. 754–768. IEEE (2022)
73. Pinaya, W.H., Graham, M.S., Kerfoot, E., Tudosiu, P.D., Dafflon, J., Fernandez, V., Sanchez,
P., Wolleb, J., da Costa, P.F., Patel, A., et al.: Generative ai for medical imaging: extending
the monai framework. Preprint (2023). arXiv:2307.15208
74. Prenner, J.A., Babii, H., Robbes, R.: Can openai’s codex fix bugs? an evaluation on quixbugs.
In: Proceedings of the Third International Workshop on Automated Program Repair, pp. 69–
75 (2022)
75. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu,
P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach.
Learn. Res. 21(1), 5485–5551 (2020)
76. Rahman, K., Ghani, A., Alzahrani, A., Tariq, M.U., Rahman, A.U.: Pre-trained model-based
NFR classification: Overcoming limited data challenges. IEEE Access 11, 81787–81802
(2023)
77. Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: Deepspeed: System optimizations enable
training deep learning models with over 100 billion parameters. In: Proceedings of the 26th
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3505–
3506 (2020)
78. Ren, S., Guo, D., Lu, S., Zhou, L., Liu, S., Tang, D., Sundaresan, N., Zhou, M., Blanco,
A., Ma, S.: Codebleu: a method for automatic evaluation of code synthesis. Preprint (2020).
arXiv:2009.10297
79. Schäfer, M., Nadi, S., Eghbali, A., Tip, F.: Adaptive test generation using a large language
model. Preprint (2023). arXiv:2302.06527
80. Scholak, T., Schucher, N., Bahdanau, D.: Picard: Parsing incrementally for constrained auto-
regressive decoding from language models. Preprint (2021). arXiv:2109.05093
81. Schuster, M., Nakajima, K.: Japanese and korean voice search. In: 2012 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152. IEEE
(2012)
82. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword
units. Preprint (2015). arXiv:1508.07909
83. Shuster, K., Xu, J., Komeili, M., Ju, D., Smith, E.M., Roller, S., Ung, M., Chen, M., Arora,
K., Lane, J., et al.: Blenderbot 3: a deployed conversational agent that continually learns to
responsibly engage. Preprint (2022). arXiv:2208.03188
84. Steenhoek, B., Tufano, M., Sundaresan, N., Svyatkovskiy, A.: Reinforcement learning from
automatic feedback for high-quality unit test generation. Preprint (2023). arXiv:2310.02368
85. Sun, T., Shao, Y., Qian, H., Huang, X., Qiu, X.: Black-box tuning for language-model-as-a-
service. In: International Conference on Machine Learning, pp. 20841–20855. PMLR (2022)
86. Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., Hashimoto, T.B.:
Stanford alpaca: An instruction-following llama model (2023). https://github.com/tatsu-lab/
stanford_alpaca
87. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra,
S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models.
Preprint (2023). arXiv:2307.09288
20 A. Moradi Dakhel et al.
88. Trivedi, H., Balasubramanian, N., Khot, T., Sabharwal, A.: Interleaving retrieval with
chain-of-thought reasoning for knowledge-intensive multi-step questions. Preprint (2022).
arXiv:2212.10509
89. Tufano, M., Deng, S.K., Sundaresan, N., Svyatkovskiy, A.: Methods2test: A dataset of focal
methods mapped to test cases. In: Proceedings of the 19th International Conference on Mining
Software Repositories, pp. 299–303 (2022)
90. Tufano, M., Drain, D., Svyatkovskiy, A., Deng, S.K., Sundaresan, N.: Unit test case
generation with transformers and focal context. Preprint (2020). arXiv:2009.05617
91. Tufano, M., Drain, D., Svyatkovskiy, A., Sundaresan, N.: Generating accurate assert state-
ments for unit test cases using pretrained transformers. In: Proceedings of the 3rd ACM/IEEE
International Conference on Automation of Software Test, pp. 54–64 (2022)
92. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 1–11 (2017)
93. Vijayakumar, A.K., Cogswell, M., Selvaraju, R.R., Sun, Q., Lee, S., Crandall, D., Batra,
D.: Diverse beam search: Decoding diverse solutions from neural sequence models. Preprint
(2016). arXiv:1610.02424
94. Wang, B., Xie, Q., Pei, J., Chen, Z., Tiwari, P., Li, Z., Fu, J.: Pre-trained language models in
biomedical domain: A systematic survey. ACM Comput. Surv. 56(3), 1–52 (2023)
95. Wang, Y., Ma, X., Chen, W.: Augmenting black-box llms with medical textbooks for clinical
question answering. Preprint (2023). arXiv:2309.02233
96. Wang, Y., Wang, W., Joty, S., Hoi, S.C.: Codet5: Identifier-aware unified pre-trained encoder-
decoder models for code understanding and generation. Preprint (2021). arXiv:2109.00859
97. Watson, C., Tufano, M., Moran, K., Bavota, G., Poshyvanyk, D.: On learning meaningful
assert statements for unit test cases. In: Proceedings of the ACM/IEEE 42nd International
Conference on Software Engineering, pp. 1398–1409 (2020)
98. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.:
Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf.
Process. Syst. 35, 24824–24837 (2022)
99. White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith,
J., Schmidt, D.C.: A prompt pattern catalog to enhance prompt engineering with chatgpt.
Preprint (2023). arXiv:2302.11382
100. Winston, P.H.: Learning and reasoning by analogy. Commun. ACM 23(12), 689–703 (1980)
101. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T.,
Louf, R., Funtowicz, M., et al.: Transformers: State-of-the-art natural language processing. In:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:
System Demonstrations, pp. 38–45 (2020)
102. Wong, D., Kothig, A., Lam, P.: Exploring the verifiability of code generated by github copilot.
Preprint (2022). arXiv:2209.01766
103. Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., Wang,
C.: Autogen: Enabling next-gen llm applications via multi-agent conversation framework.
Preprint (2023). arXiv:2308.08155
104. Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., Wang,
C.: Autogen: Enabling next-gen LLM applications via multi-agent conversation framework
(2023). https://github.com/microsoft/autogen
105. Xia, C.S., Wei, Y., Zhang, L.: Automated program repair in the era of large pre-trained lan-
guage models. In: Proceedings of the 45th International Conference on Software Engineering
(ICSE 2023). Association for Computing Machinery (2023)
106. Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., Han, S.: Smoothquant: Accurate and
efficient post-training quantization for large language models. In: International Conference
on Machine Learning, pp. 38087–38099. PMLR (2023)
107. Xie, D., Yoo, B., Jiang, N., Kim, M., Tan, L., Zhang, X., Lee, J.S.: Impact of large language
models on generating software specifications. Preprint (2023). arXiv:2306.03324
108. Xu, H., Liu, B., Shu, L., Yu, P.S.: Bert post-training for review reading comprehension and
aspect-based sentiment analysis. Preprint (2019). arXiv:1904.02232
An Overview on Large Language Models 21
109. Yan, H., Deng, B., Li, X., Qiu, X.: Tener: adapting transformer encoder for named entity
recognition. Preprint (2019). arXiv:1911.04474
110. Yang, K., Tian, Y., Peng, N., Klein, D.: Re3: Generating longer stories with recursive
reprompting and revision. Preprint (2022). arXiv:2210.06774
111. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: React: Synergizing
reasoning and acting in language models. Preprint (2022). arXiv:2210.03629
112. Yao, Z., Yazdani Aminabadi, R., Zhang, M., Wu, X., Li, C., He, Y.: Zeroquant: Efficient and
affordable post-training quantization for large-scale transformers. Adv. Neural Inf. Process.
Syst. 35, 27168–27183 (2022)
113. Yu, H., Shen, B., Ran, D., Zhang, J., Zhang, Q., Ma, Y., Liang, G., Li, Y., Xie, T., Wang, Q.:
Codereval: A benchmark of pragmatic code generation with generative pre-trained models.
Preprint (2023). arXiv:2302.00288
114. Zan, D., Chen, B., Zhang, F., Lu, D., Wu, B., Guan, B., Yongji, W., Lou, J.G.: Large
language models meet nl2code: A survey. In: Proceedings of the 61st Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), pp. 7443–7464 (2023)
115. Zhang, J., Luan, H., Sun, M., Zhai, F., Xu, J., Zhang, M., Liu, Y.: Improving the transformer
translation model with document-level context. Preprint (2018). arXiv:1810.03581
116. Zhang, Q., Chen, M., Bukharin, A., He, P., Cheng, Y., Chen, W., Zhao, T.: Adaptive budget
allocation for parameter-efficient fine-tuning. Preprint (2023). arXiv:2303.10512
117. Zhang, Q., Fang, C., Xie, Y., Zhang, Y., Yang, Y., Sun, W., Yu, S., Chen, Z.: A survey on large
language models for software engineering. Preprint (2023). arXiv:2312.15223
118. Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J.,
Dong, Z., et al.: A survey of large language models. Preprint (2023). arXiv:2303.18223
119. Zhou, Y., Muresanu, A.I., Han, Z., Paster, K., Pitis, S., Chan, H., Ba, J.: Large language
models are human-level prompt engineers. Preprint (2022). arXiv:2211.01910
120. Ziegler, D.M., Stiennon, N., Wu, J., Brown, T.B., Radford, A., Amodei, D., Christiano,
P., Irving, G.: Fine-tuning language models from human preferences. Preprint (2019).
arXiv:1909.08593
Part II
Patterns and Tools for the Adoption of
Generative AI in Software Engineering
Comparing Proficiency of ChatGPT and
Bard in Software Development
Dae-Kyoo Kim
1 Introduction
2 Related Works
In this section, we give an overview of the existing work that evaluates the
performance of Gen-LBAI models in software development.
Dakhel [1] conducted an evaluation of GitHub Copilot as an AI pair programmer,
examining its code quality compared to human-generated code. By assessing
the solutions generated by Copilot for fundamental programming problems, they
reported certain limitations (e.g., incorrectness, complexity) of Copilot and offered
some insights on how developers can leverage its capabilities.
Mastropaolo et al. [6] studied the influence of varying natural language descrip-
tions of a function on Copilot’s recommendations. They used Java methods with
corresponding descriptions and paraphrasing techniques to generate alternative
descriptions. The paraphrases produced by the paraphrasing techniques were used
for code predictions, and the differences in the recommended code were compared.
Their study found that paraphrasing led to different code recommendations in
approximately 46% of cases, affecting the accuracy of predictions.
Peng et al. [7] conducted a controlled experiment to examine the impact of
Copilot on productivity. Two groups of programmers were tasked with implement-
ing an HTTP server in JavaScript where one group had access to Copilot and
the other group did not. They reported that the group utilizing Copilot completed
the task 55.8% faster, suggesting the potential of AI-pair programmers to enhance
programming productivity.
Wermelinger [11] investigated the performance of Copilot in generating code,
tests, and explanations in supporting students to solve CS1 problems and compared
it to Davinci, which is a Codex model, in terms of correctness, diversity, and inter-
active guidance toward correct solutions. They reported that Copilot demonstrated
less effectiveness than Davinci on correctness and diversity.
Sobania et al. [8] evaluated Copilot’s performance on program synthesis using a
set of benchmark problems and compared the results with those of GP reported in
the literature. They examined the programs produced by Copilot and GP using the
Copilot Extension for Visual Studio Code for correctness. Their study found that
Copilot and GP exhibited similar performance on the benchmark problems.
In this section, we evaluate and compare the capabilities of ChatGPT and Bard on
software development using the TORS system. Figure 1 illustrates the approach.
Initially, a description of TORS is provided. The description is used as a base for
requirements analysis. In this phase, the abilities of ChatGPT and Bards to identify
and clarify ambiguities and define functional and non-functional requirements are
evaluated. The output of requirements analysis is then used as input for design
28 D.-K. Kim
modeling, where ChatGPT and Bard are evaluated for their capabilities to make
design decisions such as operation assignment, navigabilities, and data types. The
resulting design models are implemented in Java. In this phase, ChatGPT and Bard
are evaluated for their abilities to generate code and rectify errors. The output
of each phase is evaluated based on object-oriented principles [5] by a software
researcher with 32 years of experience in software engineering. For fair comparison,
both models were given the same instructions in the experiments.
In this phase, both models were evaluated for their proficiency in comprehending
and analyzing the problem description of the TORS system. The evaluation focuses
on identifying ambiguities within the description, rectifying them to enhance
the clarity of the problem statement, extracting functional and non-functional
requirements from the description, and generating comprehensive and detailed use
case specifications. As an input to the models, the description of TORS, as shown
in Fig. 2, was given.
Identifying Ambiguities In this task, ChatGPT and Bard were tasked to identify
ambiguities in the TORS description. The following prompt was given to both
models:
– Prompt: “For the given description, identify ambiguities.”
Table 1 shows their outputs. In the outputs, it is observed that ChatGPT
focuses more on the technical aspects and functionality (e.g., built-in network,
automation level), while Bard considers more user experience and interactions with
the system (e.g., user interaction methods in viewing tour information). ChatGPT’s
output can be more helpful from a system developer perspective to understand the
Comparing Proficiency of ChatGPT and Bard in Software Development 29
implementation aspects of TORS, while Bard’s output is more insightful from a user
perspective. These observations suggest that their combined use can lead to a more
comprehensive description of the system, covering both technical and user-oriented
aspects.
Clarifying Ambiguities After identifying ambiguities, both models were instructed
to clarify the identified ambiguities. The following prompt was given to the models:
– Prompt: “Clarify the identified ambiguities in the output of the previous prompt.”
Table 2 shows the comparison of the produced clarifications. The clarifications
by ChatGPT are well aligned with the identified ambiguities, which are from a
developer perspective, including built-in network referring to a Web interface, the
process being fully automated without requiring employee involvement, built-in
email client and CRM in the system, and ways to access the system. Bard also
effectively addresses the ambiguities that are from a user perspective, including
types of tours and ways to access tour information, make or cancel a reservation,
send feedback, manage customer information, and access the system.
Revising Descriptions Based on the produced clarifications, both models were
instructed to revise the original description of TORS by incorporating the clarifi-
cations. The following prompt was given to the models:
– Prompt: “Revise the original description of TORS by reflecting the clarifications
in the output of the previous prompt.”
Table 3 shows the revised descriptions. ChatGPT effectively integrated three out
of five clarifications, addressing the built-in network, email feedback, and CRM.
However, it still falls short in properly incorporating the clarifications regarding
human intervention and the security measure. Bard also struggled with integration.
Only three out of seven clarifications were properly incorporated. It misses the
clarification on types of tours in the revision and only partially integrates the
30 D.-K. Kim
In ChatGPT’s output, the FRs are deemed comprehensive, but there are some
overlaps. For example, Maintaining tour information overlaps with Managing
customers and tours, and Providing reservation about available tours overlaps
with Making a reservation on a tour. These overlaps may have occurred due to
the parsing of verbs in the description. Regarding NFRs, ChatGPT seems to have
expanded beyond the scope of the original description. The first three qualities (user-
friendliness, accessibility, persistency) align with the description, while the rest
(availability, reliability, scalability, compatibility, security, efficiency, and tolerance)
are more general in nature and not explicitly stated in the description. On the
other hand, the FRs produced by Bard do not exhibit any overlaps. However,
32 D.-K. Kim
there are some requirements that are missing. For instance, it fails to capture the
requirement of storing feedback in the database and the requirement related to login
and authentication. In terms of NFRs, all the identified qualities appear to be of a
general nature, such as security, ease of use, reliability, scalability, and availability,
which are not specifically stated in the original description. Although ChatGPT had
some overlaps in its outputs, overall, they are more complete compared to Bard’s
outputs, especially for FRs.
Generating Use Case Specifications In this task, ChatGPT and Bard were evaluated
for their capabilities to generate use case specifications. For the requirement
of making a reservation, they were instructed to produce a detailed use case
specification. The following prompt was given to the models:
– Prompt: “For the “Making a reservation on a tour” requirement, generate a
detailed use case specification.”
Table 4 presents the outputs. The use case specification produced by ChatGPT
demonstrates a well-structured format with adequate details, following a commonly
used form [5]. The preconditions and postconditions align well with the TORS
Comparing Proficiency of ChatGPT and Bard in Software Development 33
description, and the basic flow effectively depicts the sequence of interactions,
including additional details that appear to be inferred from ChatGPT’s domain
knowledge such as filling out the reservation form in Step 6 (which was not explic-
itly stated in the original TORS description but added in the revised description),
verifying tour availability in Step 7, sending a confirmation in Step 8, and updating
information in database in Step 9. Although these additional details are subject to
be confirmed by the customer, they contribute to the maturity and richness of the
requirements, potentially aiding in the identification of any missing requirements.
Alternative flows also provide additional details inferred by ChatGPT, especially for
scenarios where the selected tour is not available and payment details are invalid.
However, they are somewhat brief and miss steps. Postconditions, which describe
any state changes made by the use case, appear reasonable, except the one regarding
the customer receiving a reservation confirmation email, which is not necessarily a
state change. On the other hand, the specification lacks information on the scope and
level of the system under development. The use case specification produced by Bard
presents a similar structure to ChatGPT’s, adhering to the commonly used form.
Compared to ChatGPT’s output, Bard’s preconditions cover only a single condition
similar to the one in ChatGPT’s specification. The basic flow generally aligns with
the TORS description, incorporating some details inferred from Bard’s domain
knowledge such as the payment process in Step 5 and 6. However, the sequence
lacks interactive system responses, which are crucial from both a user perspective
and a development standpoint. On the other hand, the alternative flows appear to
be more comprehensive than ChatGPT’s, encompassing more scenarios such as
users not having an account or entering incorrect information. It is notable that
Bard’s specification also includes an exception handling section, which resembles
the nature of alternative flows. Bard’s specification, however, does not include any
postconditions. In summary, both ChatGPT and Bard incorporated additional details
inferred from their domain knowledge, which contributes to the maturation and
enrichment of the requirements, and ChatGPT’s specification appears to be more
detailed in most aspects, except for alternative flows where Bard’s specification
offers more comprehensive coverage.
In this phase, we compare the capability of ChatGPT and Bard on constructing UML
domain class diagrams that contain domain concepts, attributes, and relationships
without any design decisions such as operations, navigabilities, data types, or code-
level details such as visibility.
Identifying Domain Concepts In this task, we evaluate the ability of ChatGPT and
Bard to identify domain concepts from the TORS description. The following prompt
was given to the models:
– Prompt: “Identify domain concepts from the TORS description.”
Comparing Proficiency of ChatGPT and Bard in Software Development 35
The identified domain concepts along with descriptions are shown in Table 5.
Both models successfully identified common domain concepts such as Tour, Tour
Information, Reservation, Customer, Feedback, Employee, and Login credentials
with similar descriptions. ChatGPT identified an additional concept Database.
However, both models failed to identify System, Reservation facilities, Travel
agency, and Built-in network, which are meaningful concepts to the TORS system.
These observations hint that both models have competitive proficiency in identifying
domain concepts, although there is room for improvement in capturing more
relevant concepts.
Identifying Attributes In this task, ChatGPT and Bard were evaluated to identify
respective attributes for the identified domain classes. The following prompt was
given to the models:
– Prompt: “Identify attributes for the domain classes identified in the previous
prompt.”
ChatGPT identified a total of 49 attributes for the eight domain classes. Among
these attributes, approximately 25 (51%) appear to be drawn from ChatGPT’s
domain knowledge. Similarly, Bard identified a total of 37 attributes for the seven
domain classes, with about 11 (30%) attributes inferred from Bard’s domain
knowledge. Both models generated by Bard and ChatGPT produced many attributes
in common, for instance, (i) Tour ID, Tour name, Destination, Duration (Date
36 D.-K. Kim
and Time), Cost (Price), and Available dates (Availability) attributes in the Tour
class; (ii) the Tour ID, Description, and Images attributes in the Tour Information
class; and (iii) the Reservation ID, Tour ID, Customer ID, Reservation date (Date
and Time), Number of People, and Payment status (Payment method and Status)
attributes in the Reservation class. This might imply that both models have acquired
similar domain knowledge, likely from comparable sources. The models also
produced additional attributes specific to them such as (i) the Itinerary, Inclusions,
Exclusions, Departure location, and Available dates attributes in ChatGPT’s outputs
and (ii) the Videos attribute in Bard’s outputs. From these observations, both models
are capable of inferring meaningful attributes, many of which are common, but
ChatGPT’s outputs are a bit more ample than Bard’s. It should be noted that any
inferred attributes should go through a review and be confirmed by the client before
adoption.
Identifying Relationships Based on the identified domain classes, ChatGPT and
Bard were instructed to identify their relationships. The following prompt was given
to the models:
– Prompt: “Identify relationships for the identified domain classes.”
relationships above, the models were instructed to generate a domain class diagram.
The following prompt was given to the models:
– Prompt: “Based on the classes, attributes, and relationships identified earlier,
generate a domain class diagram.”
However, being text-based models, they were unable to produce a visual class
diagram. Instead, they provided a textual description of the domain class diagram,
as shown in Fig. 3. For better understanding, we manually created class diagrams
corresponding to the descriptions as shown in Fig. 4. As domain class diagrams are
part of the problem domain, they should not involve any design decisions such as
operations, navigability, and data types, which belong to the solution domain.
The diagrams generated by both models do not contain any design decisions,
which serves the intended purpose. However, there are some inconsistencies in the
generated class diagrams with the previously identified elements. In ChatGPT’s
diagram, the relationships of the (Database, Tour), em (Database, Reservation),
(Database, Employee), and (Database, Login credentials) are missing, which
were previously identified in Table 6. Similarly, Bard’s diagram also lacks the
relationship of (Feedback, Customer), which was previously identified in Table 6.
Bard’s diagram also contains inconsistent attributes. The diagram shows the Tour
Information attribute in the Tour class, but it was not included in Table 5. Bard’s
model involves association names, which are appropriate and expected for domain
models, whereas ChatGPT’s model does not. However, Bard’s model has the
Feedback class without any relationships, which is not good. In terms of maturity,
ChatGPT’s model appears to be more comprehensive, as it depicts more complete
relationships among classes. The two models show similar structures that might
imply that their domain knowledge was built on similar sources.
In this phase, we evaluate the capabilities of ChatGPT and Bard in producing design
models in terms of class diagrams and sequence diagrams. Design class diagrams
differ from domain class diagrams in that they involve design decisions such as
operations, navigabilities, and data types, which are part of human creativity. Hence,
a key aspect to observe in this evaluation is how the models handle making design
decisions.
Generating Design Class Diagrams The models were evaluated on building design
class diagrams based on the domain class diagrams in Fig. 4. The following prompt
was given to the models:
– Prompt: “Generate a design class diagram based on the domain class diagram
produced earlier.”
Comparing Proficiency of ChatGPT and Bard in Software Development 39
Fig. 4 Manually created domain class diagrams corresponding to the class diagram descriptions in Fig. 3
D.-K. Kim
Comparing Proficiency of ChatGPT and Bard in Software Development 41
The generated textual descriptions of the design class diagrams are shown in
Fig. 5. For better understanding, we manually built graphical diagrams per the
generated descriptions as shown in Fig. 6. In ChatGPT’s diagram, the attributes
that denote relationship roles are represented on association ends (e.g., tour in
Reservation). Getters and setters are excluded by convention. Key points to observe
in the diagrams are design decisions made and consistency with the domain class
diagrams in Fig. 4.
The diagram generated by ChatGPT involves classes and relationships with
operations, data types, multiplicities, and navigabilities. All the classes in the
diagram were identified during domain modeling (cf. Table 5), but it does not
include the Tour information and Login credentials classes that were identified
during domain modeling, which is an inconsistency. In fact, the missing classes
are important data classes that should not be missed. The Database class shows no
relationships with other classes, which is not good and also inconsistent with the
domain model. All the classes in the diagram are defined with operations, which
are important design decisions, but most of them are getters and setters for attribute
access. The Reservation, Customer, and Employee classes include feature operations
such as viewTourinformation() and makeReservation(), which appear to be derived
42
Fig. 6 Manually created design class diagram corresponding to the text-based design class diagram by ChatGPT and Bard
D.-K. Kim
Comparing Proficiency of ChatGPT and Bard in Software Development 43
from the TORS description, demonstrating the ability of making design decisions.
On the other hand, the operations in the Database class appear to be inferred from
ChatGPT’s domain knowledge, as they are typical data management operations like
add, update, and delete. With respect to attributes, all the attributes are appropriately
typed and follow a consistent naming convention of using lowercase for the first
letter and capitalizing the first letter of each subsequent word, which is impressive.
There are also some attributes that denote relationship roles such as the tour
and customer attributes in the Reservation class and the customer attribute in the
Feedback class, which should not be included in practice. Regarding relationships,
they were all identified during domain modeling (cf. Table 6). However, the
description does not include the relationships of (Tour, Tour information), (Tour
information, Database), (Customer, Database), and (Feedback, Database) identified
in domain modeling. The multiplicities on relationship ends are consistent with
those identified during domain modeling (cf. Table 5). However, the multiplicity on
the Employee end in the (Employee, Tour) relationship and the multiplicity on the
Customer end in the (Employee, Customer) relationship are missed. With respect
to navigabilities, they are all bidirectional but defined only for the relationships
of (Tour, Reservation), (Customer, Reservation), and (Customer, Feedback). The
navigabilities for the (Employee, Tour) and (Employee, Customer) relationships
are missed. Overall, the evaluation demonstrates ChatGPT’s capability of making
appropriate design decisions with some inconsistencies with the domain model.
The diagram generated by Bard includes classes and relationships with visibil-
ities, operations, and multiplicities but lacks data types and navigabilities, which
are important design decisions. All the classes that were identified during domain
modeling (cf. Table 5) are present, which demonstrates descent consistency. It also
introduces a new class System, which appears to serves as the facade controller,
providing system-level feature operations. However, the class has no relationships
with other classes, which is not good. However, except for the System class, the
operations in all other classes contain only getters and setters with no feature
operations, which is not satisfactory from a design decision point of view. The
Public visibility is used for operations by default. With respect to relationships,
they were all identified during domain modeling (cf. Table 6), except for the
(Customer, Tour) relationship, which is newly added. However, the (Feedback,
Customer) relationship that was identified in domain modeling is missed. Overall,
the evaluation demonstrates the limited capability of Bard on making design
decisions on navigabilities and data types with some inconsistencies with the
domain model.
In comparison, ChatGPT’s diagram is superior to Bard’s diagram as it con-
tains more comprehensive design decisions such as data types and navigabilities.
Moreover, ChatGPT’s diagram better incorporates feature operations across classes,
whereas Bard has feature operations only in the System class. On the other
hand, ChatGPT’s diagram lacks visibilities, while Bard’s diagram has them. Both
diagrams lack return type for operations and contain inconsistencies with from the
domain models.
44 D.-K. Kim
Fig. 7 Manually created design sequence diagram corresponding to the text-based design sequence diagram by ChatGPT and Bard
45
46 D.-K. Kim
makes the diagram more like a system sequence diagram in requirements analysis.
Furthermore, the operations in Bard’s diagram lack proper naming and show
significant inconsistency with the design class diagram in terms of operations.
3.4 Implementation
In this phase, we evaluate the ability of ChatGPT and Bard to implement design
class diagrams and sequence diagrams. We chose Java for the implementation
language.
Implementing Design Class Diagrams In this task, ChatGPT and Bard were
assessed for their capabilities in implementing the design class diagrams in Fig. 6.
The following prompt was given to the models:
– Prompt: “Implement the design class diagram produced earlier.”
Among the generated classes, we chose the Tour class for comparison, which
contains more comparable characters in terms of code details. The Customer class
was also chosen for ChatGPT as it contains code details highlighting the coding
abilities of ChatGPT. ChatGPT produced a basic implementation of the Tour
class, incorporating attributes and empty methods. The attributes are defined with
code-level details such as visibility and data type. However, some attributes are
inconsistent with the class diagram such as tourDestination typed with int, but its
corresponding attribute in the class diagram has a different name tourDate typed
with Date. All the methods were empty in the initial implementation, although
they were accompanied by comments outlining the intended implementation for
each method. The methods including the constructor were incrementally filled out
with specific implementations as further instructions were given. In the Customer
class, the cancelReservation method demonstrates the incorporation of various
control structures such as if-else statements, for loops, and nested structures. In the
implementation of the Reservation class (which is not presented in the figure), the
incorporation of data structures such as List and ArrayList was observed. However,
none of the implemented methods match the operations in the design class diagram,
which create significant inconsistency.
The implementation generated by Bard encompasses attributes with specified
visibility and data types, along with the fully implemented getters, setters, con-
structor, and toString() method. The implementation of the constructor appears
appropriate, and the getters and setters follow standard conventions. It is notable
that the implementation includes the toString() method, which is a common practice
to provide a string representation of an object. Also, the implemented methods
align well with the operations defined in the design class diagram. However, the
implementation lacks the Tour information attribute as defined in the class diagram.
In conclusion, Bard’s implementation exhibits a more complete code with
standard conventions and higher consistency with the design class diagram. On
Comparing Proficiency of ChatGPT and Bard in Software Development 47
the other hand, ChatGPT’s implementation contains incomplete code and more
inconsistencies with the design class diagram, which further leads to traceability
issues between phases. However, ChatGPT was able to incrementally develop the
full implementation with rich code-level details as additional instructions were
provided.
Implementing Design Sequence Diagrams In this task, we evaluate the ability of
ChatGPT and Bard on implementing design sequence diagrams. For the evaluation,
they were instructed to implement the Make Tour Reservation sequence diagram in
Fig. 7. The following prompt was given to the models:
– Prompt: “Implement the “Make Tour Reservation” design sequence diagram
produced earlier.”
Table 7 shows the generated implementation. ChatGPT implemented the
sequence diagram as part of the driver class TORS, which is a new class not
included in the class diagram and, thus, does not participate in the sequence
diagram. The implementation attempts to follow the sequence in the diagram but
does not fully adhere to the intended sequence. Specifically, the diagram specifies
that the createReservation() method is called within the selectTour() method by
the Customer class, but the implementation shows that the createReservation()
method is called within the makeReservation() method by the TORS class, which
does not participate in the sequence diagram. This happened as the sequence
diagram was implemented solely in a single method, which demonstrates significant
inconsistency with the design. Furthermore, the implementation is incomplete in
certain parts leaving code sections marked with “...”.
Similar to ChatGPT, Bard implemented the sequence diagram as part of the
drive class TourReservationSystem, which is newly added during implementation
and, thus, does not participate in the sequence diagram. Although the sequence
diagram is implemented solely within the driver class, the overall sequence of the
implementation aligns well with the sequence diagram. This happens because the
design sequence diagram was built similar to system sequence diagrams where
sequences are captured between only the actor and the system. So, this can
be viewed as a wrong design but correct implementation to the design. There
also exists an inconsistency where the status parameter in the sequence diagram
is implemented as updating the tour’s availability. Overall, the implementation
maintains a descent level of consistency with the sequence diagram in terms of
message sequences.
In summary, Bard’s implementation exhibits a higher level of consistency with
the design compared to ChatGPT. However, Bard’s sequence diagram was far more
simpler, which makes it easier to be implemented. On the other hand, ChatGPT’s
sequence diagram was more realistic and practical, but its implementation had
significant inconsistency with the design. Both ChatGPT and Bard incorporated data
generation in the driver class, which was impressive.
Fixing Errors The generated code by ChatGPT and Bard contained a number of
errors of different kinds, including missing import statements, missing modifiers
48 D.-K. Kim
Reservation reservation = new Reservation(); // Check if the number of people is less than or
equal to the capacity of the tour
String reservationId =
reservation.createReservation(selectedTour); if (numberOfPeople > tour.getCapacity()) {
tour.decrementAvailableSeats(selectedTour); }
employee.confirmReservation(reservationId); customer.getCustomerID(),
reservation.saveReservation(reservationId); numberOfPeople,
} paymentMethod
);
reservations.add(reservation);
tour.setAvailable(false);
System.out.println(“Reservation created
successfully!”);
}
Comparing Proficiency of ChatGPT and Bard in Software Development 49
(e.g., static), undefined variables (e.g., symbol errors), undefined data types (e.g.,
undefined enumerations), undefined methods (e.g., missing getters and setters),
parameter mismatches (e.g., redundant arguments, type mismatches, ordering mis-
matches), and misuse of methods and constructors in library classes (e.g., Date,
Time). However, they were able to fix the majority of the errors effectively (more
than 90%), which was impressive. In Bard, sometimes, the fixed code did not
contain any change, and further instructions had to be given until a proper fix was
incorporated. Such an issue did not occur with ChatGPT. Although both models suc-
cessfully addressed most errors individually, the fixes were made piecemeal for each
specific error, which led to many inconsistencies among fixes, which introduced new
errors, and further fixes had to be introduced for the errors. This cycle continued
until the human developer intervened and resolved the errors manually. In summary,
both ChatGPT and Bard demonstrated competitive capabilities in fixing individual
errors, but human involvement was required to finalize the implementation and make
it executable.
4 Discussion
Based on the findings in Sect. 3, we answer the research questions posed in Sect. 1
as follows.
What Are the Respective Strengths and Weaknesses of ChatGPT and Bard in the
Context of Software Development? In requirements analysis, ChatGPT focused
more on technical aspects and functionality, while Bard was concerned more on
user experience and system interactions. ChatGPT performed slightly better in
clarifying ambiguities in requirements, incorporating clarifications into require-
ments, extracting FRs and NFRs, and generating use case specifications. In domain
modeling, Bard’s model was easier to understand due to the inclusion of association
names. However, ChatGPT’s model offers a more comprehensive structure with
more relationships among classes. In design modeling, ChatGPT produced more
mature and practical designs compared to Bard. ChatGPT’s design class diagram
incorporates more design decisions of different kinds than Bard’s. In terms of design
sequence diagrams, ChatGPT’s diagram demonstrates more collaborative behaviors
among objects, while Bard’s diagram primarily focuses on interactions between
two objects, resembling a system sequence diagram used in requirements analysis.
In implementation, Bard’s implementation was more consistent to the design with
standard coding practice. In contrast, ChatGPT’s implementation was less complete
in the initial implementation, but with further instructions, it incrementally evolved
to the mature level with rich code-level details. In error fixing, ChatGPT slightly
outperformed Bard in terms of accuracy. However, both models suffered from
significant inconsistency between the design and implementation, which further
raises traceability issues between phases.
50 D.-K. Kim
How Can ChatGPT and Bard Be Effectively Used Together for Software Devel-
opment? It is found that ChatGPT and Bard focus on different aspects (technical
aspects in ChatGPT and user aspects in Bard) in requirements analysis, which
suggests that their combined use can help cover more comprehensive aspects and
achieve balanced analysis of requirements. ChatGPT is more mature in design mod-
eling in terms of making design decisions, while Bard can supplement additional
design elements (e.g., classes, attributes, operations, relationships). Both ChatGPT
and Bard demonstrate comparable capabilities in generating code and fixing errors,
with limited benefits in their combined use.
Threats to Validity The prompts used in this work are designed to be sequential
where the output of an earlier prompt is used as input for a later prompt, simulating
the forward engineering in the development process. Different approaches may
yield different outputs. The findings are based on only one case study, and thus,
the assessment is not general enough. The assessment was carried out by a
single researcher with 32 years of experience in software engineering. Different
assessment results may be drawn by other researchers.
5 Conclusion
The findings in this study suggest that ChatGPT and Bard had different focuses in
requirement analysis, ChatGPT performed slightly more effectively in design mod-
eling, and they are comparable in implementation and error fixing. However, both
models suffer from many inconsistencies among generated artifacts, and significant
inconsistency was observed between design and implementation, which further
leads to traceability concerns between phases. In both models, the involvement
of the human developer was required to address inconsistency issues and finalize
the development. Despite these challenges, it was witnessed that the models have
great potential to contribute to software development and their combined use can
be effective particularly in requirements analysis and design modeling. Further
studies with more case studies are needed to generalize the assessment and provide
statistical support. More focused studies targeting specific activities in a phase can
be valuable, such as applying design patterns in the design phase and refactoring
code in the implementation phase. Also, the testing phase, which is not covered in
this work, needs investigation. For more comprehensive comparison, other models
such as GitHub Copilot [2] can be included.
References
1. Dakhel, A.M., Majdinasab, V., Nikanjam, A., Khomh, F., Desmarais, M.C., Jiang, Z.M.:
GitHub copilot AI pair programmer: asset or liability? J. Syst. Software, 111734 (2023)
2. GitHub: Github copilot (2021). https://copilot.github.com/
Comparing Proficiency of ChatGPT and Bard in Software Development 51
1 Introduction
Human history is defined by the use of tools. From the use of rudimentary stones
for preparing food in the Paleolithic to rockets taking us to space today, tools
have been used to augment our capabilities or even to give us newer and more
complex ones. As with any other human activity, software engineering (SE) has
also experienced, since its inception, the continuous creation of tools to increase
productivity, reduce errors, and facilitate the work of those involved in the activity.
High-level programming languages, integrated development environments (IDEs),
and frameworks, just to name a few, allowed the development of increasingly
complex software systems with reduced effort and time spent [28]. For at least a
decade, researchers have been exploring the use of artificial intelligence (AI) to
support SE tasks [31]. The emergence of a novel generation of AI tools, driven
by generative large language models (LLMs), promises to be a disruption on how
software is developed, intensifying this tendency. Some research has investigated
how developers have employed different tools, such as IDEs [17], and even code
generation tools [37]. However, a framework to understand the evolution of these
tools, including generative AI, still lacks.
In this chapter, based on a literature review of taxonomies in SE and other
fields, we propose DAnTE, a taxonomy of SE automation tools that considers the
whole evolution of these tools, positioning generative AI-based tools in a historical
continuum. DAnTE consists of six levels of degree of automation: No automation
(Level 0), Informer (Level 1), Suggester (Level 2), Local generator (Level 3), Global
generator (Level 4), and Full generator (Level 5). To evaluate the taxonomy, we use
it to classify two sets of tools focused on different SE activities: coding and testing.
This utility demonstration indicates that the taxonomy could be used for classifying
different tools for the automation of diverse SE tasks with clear advantages over
the existent classification schemes described in the literature. Practitioners could
employ it for evaluating and choosing tools to be used in different contexts. The
taxonomy could be also useful for classifying research endeavors and guiding the
development of novel tools.
2.1 Taxonomies
choose tools to be used in their projects, researchers to develop novel solutions, and
educators to prepare the future actors for these tasks.
In the SE literature, there are taxonomies focused on related topics. Feldt et al. [8]
proposed AI-SEAL, a taxonomy for the use of AI in SE, consisting of three facets:
point of application, type of AI applied, and level of automation. Point of application
regards “when” and “on what” the tool is being applied, i.e., process, product, or
runtime. Type of AI applied represents the type of AI technology used, for instance,
connectionist or symbolist. Finally, the level of automation, described in Table 1,
regards to which extent the tool assists the human in performing the task. The scale is
adapted from the ten levels of automation in human-computer interaction proposed
by Sheridan and Verplank [29]. The adoption of a taxonomy proposed for another
field has the positive aspect of having been evaluated; however, even if it is related
to the topic, it does not consider the specificalities of the new field, in our case, SE.
Another issue is the focus on AI, neglecting the long history of tools developed for
SE automation.
Savary-Leblanc et al. [25] performed a systematic mapping study on the
employment of software assistants in SE. They classified assistants into three types:
informer systems, passive recommender systems, and active recommender systems.
Informer systems simply display the results of data analysis without any side effect.
Passive recommender systems analyze data and potentially produce one or several
alternatives for a decision-making problem. Finally, an active recommender system
56 J. Melegati and E. Guerra
In her seminal paper on an engineering discipline for software, Shaw [28] reviews
how SE evolved until that moment, including a discussion on the tools used. In
the beginning, around 1960, programs were small, and tools consisted mainly
of assemblers and core dumps. Around 1970, programs became more complex,
including more algorithms and data structures, but still usually consisted of
individual effort. In this period, tools like higher-level programming languages
and compilers emerged. Around 1980, systems became even more complex, with
some running continuously, and were the result of team efforts, leading to the
emergence of the IDEs. For example, Teitelman and Masinter [34], in a paper
from 1981, describe an integrated environment for the Lisp language. The tool
allowed programmers to switch between tasks, such as debugging and editing. In the
following years, IDEs became more sophisticated. In the 2000s, programmers could,
for example, in Eclipse, perform several tasks, such as navigating the code structure
and perform automatic refactoring [17]. More recently, the increased usage of
dynamic languages, such as JavaScript [35], led to the emergence of linters, simple
static analysis tools that “flag non-complex programming errors, best practices, and
stylistic coding standards” [9].
With the advancements in AI, researchers have been proposing tools to auto-
matically generate code, for example, IntelliCode [31]. These tools have started to
move to the industry, especially as add-ons to IDEs, such as GitHub Copilot,1 which
could be used in, for example, Visual Studio and IntelliJ. The recent emergence
of LLMs with tools like OpenAI’s ChatGPT could represent another leap in this
evolution. Researchers have already investigated its potential use for SE tasks, such
as architecture [1] and testing [42]. More in this regard can be found in other
chapters of this book.
3 Research Method
1 https://github.com/features/copilot.
58 J. Melegati and E. Guerra
studies are considered one of the best options to generate a taxonomy, given its
speed, connection with existing theories, and greater rigor [22]. A natural choice
would be to conduct a systematic review or mapping study in SE literature to
identify tools that automate processes. However, the SE literature breadth and the
interval of time that should be considered, given the goal of considering all attempts
for automating the different tasks, would lead to a massive number of papers to
be analyzed. Ralph [22] suggest that taxonomies can be created “by synthesizing,
combining, refining or adapting existing theories from SE or reference disciplines.”
Given the success of the taxonomy for autonomous driving, suggested by its
adoption by industry and governments [43], we decided to take it as a reference
to build a similar taxonomy to SE automation.
To perform this adaptation, we performed a literature review focusing specially
on secondary studies that classified, to some extent, automation tools for SE, as
described in Sect. 2. We also reviewed primary studies, especially recent ones that
focused on novel solutions, such as ChatGPT, which, given the short time interval
since the launch, would probably not be reported in scientific published secondary
studies. Then, using the taxonomy of driving automation, we compared the different
tools, grouping them according to levels of automation. Based on this process, we
propose a taxonomy for the automation degree of SE tasks described below.
To evaluate the taxonomy, we identified some tools regarding two SE activities,
namely, coding and testing, and classified them according to the taxonomy’s classes.
We aim to demonstrate that it fulfills the criteria suggested by Ralph [22], i.e., (1)
its class structure should reflect “similarities and dissimilarities between instances,”
(2) its classes facilitate the inference of properties of an instance based on how it has
been classified, and (3) it is effective regarding the proposed purpose. Considering
different approaches for taxonomy evaluation [36], this evaluation study can be
classified as a utility demonstration, in which the taxonomy’s utility is demonstrated
by actually classifying subject matter examples [30, 39]. The tools used in the
evaluation were taken from academic reports in the existing literature, but tools
considered as state-of-practice, used by industry, were also considered.
Level 5 ... specify ... build the solution architecture and automatically
Full generator solution generates the necessary code.
requirements.
... provide the solution ... build a candidate solution architecture and
Level 4
specification and automatically generates the necessary code
Global generator
verifies the generated code. to be checked by the developers.
Level 0
No automation ... are responsible for all the decisions and for implementing them.
This level considers the scenario with a total lack of automation or support tools. A
typical example is a developer or development team employing simple text editors to
create code. This scenario represents how software was developed in the early days
of computing, especially before the advent of IDEs [34]. Except for educational
settings and new approaches, we do not expect to find software being professionally
developed in such a way nowadays. However, we believe it is important to have this
level in the taxonomy as a baseline for the other levels. We also highlight that the
software development activity started at this level and new software development
platforms might start here.
Developer Involvement The developer is completely responsible for performing
the tasks and making all the decisions required to achieve the goals.
Tool Involvement No tool is employed to automate the process.
This level represents the situations in which automatic tools are able to provide
developers useful information; however, they are not able to suggest improvements.
Some examples are tools included in simple IDEs that manage to statically check
the code syntax and warn the user of problems but are not able to propose fixes.
However, this level is not limited to static checks, and this information might
also be extracted at runtime. An example would be information about test code
60 J. Melegati and E. Guerra
coverage. More recent tools at this level are tools that automatically identify
software vulnerabilities and warn developers of such issues.
Developer Involvement The developer is responsible for performing the task and
making all the decisions needed considering, or not, the information provided by
the tool.
Tool Involvement The tool informs the developer or warns of issues, but is not
able to produce suggestions for fixing them.
At this level, tools are able not only to identify issues but also to propose solutions
for the developers. In other words, the tool is able to automatically suggest
modifications to the code originally created by the developer. Besides that, the
decision to incorporate or not the suggestions is made by the developer. There
are several examples in modern IDEs, such as auto-completion tools or refactoring
suggesters. Another example is DependaBot,2 a tool for GitHub that automatically
identifies dependencies for the project in a repository that could be updated and
suggests the most recent versions by automatically updating dependencies files.
Developer Involvement The developer is responsible for performing the tasks, but
the tool is able to, automatically or not, suggest modifications.
Tool Involvement The tool suggests one or more possible alternatives from which
the developer can either choose one or, even, ignore all of them.
At this level, tools are able to automatically perform tasks in a constrained situation
with limited scope. This task can be based on the current state of the artifacts or
based on a brief description provided by the developer. At a module, component,
or solution level, the developer, or development team, is still responsible for the
conception. An example is the GitHub Copilot that, at the moment of writing,
is able to compose methods, functions, or even classes, based on comments or
function signatures; however, the overall structure of the code is defined by the
developer. Another example will be tools that automatically detect problems, like in
the previous level, but autonomously decide to change the code to correct or improve
it.
2 https://github.com/dependabot.
DAnTE: A Taxonomy for the Automation Degree of Software Engineering Tasks 61
Tools at this level might be able to produce complete solutions given a description
in a natural or domain-specific language. However, a developer still needs to verify
and, if needed, modify the solution. The tools could recognize that specific parts
of the system are beyond their capabilities and recommend that developers should
implement these parts. Currently, some informal reports are evaluating the use of
LLMs, such as ChatGPT, to achieve this level, with still modest results, without
any reported usage in real projects. Other tools at this level are limited to specific
scenarios, such as no-code tools.
Developer Involvement The developer only provides the solution specification
and verifies the generated solution.
Tool Involvement The tool implements a solution, including any necessary archi-
tecture and code, to be checked by developers.
In this final level, tools would be able to develop a complete solution relying mainly
on interactions with developers using natural language instead of programming
language, representing a shift on the level of abstraction required for coding. This
change is similar to the emergence of high-level programming languages that
allowed most of the developers to not handle assembly languages, restricting it to
specific cases. In the future, programming languages might become just another
intermediate step that could be inspected but would not be the standard approach
to develop software. Figure 2 illustrates this scenario. Depending on its scope, it is
expected that a tool could generate all the artifacts needed for the system, which,
besides source code, can also include others, such as models and documentation.
Currently, some no-code tools might be able to perform similar tasks in really
restricted contexts.
62 J. Melegati and E. Guerra
Natural language
instructions
Automation
tool
Source
Developer code
Execution of the
solution
Binary
Developer Involvement The developer only needs to specify the solution require-
ments to the tool in a natural language.
Tool Involvement The tool builds the solution architecture and automatically
generates the necessary code.
5 Evaluation
5.1 Level 0
Coding At this stage, no automation is foreseen. For coding, this scenario consists
of developers using simple text editors, without any specific features for handling
code, such as syntax checker, to write code. In the early days of software
development, around the 1960s, until the beginning of the 1980s, programs were
small and generally done as an individual effort [28]. The need for larger systems
and development teams led to the emergence of development environments and
integrated tools [28]. Since then, it is hard to imagine software being produced
without any supporting tool, except for small educational tasks.
Testing This scenario consists of developers elaborating the code of automated
tests manually, using only the requirements and information about the testing API.
DAnTE: A Taxonomy for the Automation Degree of Software Engineering Tasks 63
Only a testing framework, such as JUnit [6], is used to provide some structure to the
tests. The setup for the tests and the verification are configured through code. The
developer has no aid on the test structure and its coverage.
5.2 Level 1
Coding In this level, tools inform developers about potential issues but are not able
to make any suggestions for fixing them. For coding, this level include tools or
simple code editors that, for instance, are able to identify syntax errors or the use of
non-existent variables or functions. A contemporary example is the use of linters,
“lightweight static analysis tools that employ relatively simple analysis to flag non-
complex programming errors, best practices, and stylistic coding standards” [9].
These tools are especially useful for dynamically typed languages to warn about
errors that, in these languages, would be only observed in runtime. ESLint3 for
JavaScript and Pylint for Python4 are some examples.
Testing In this scenario, the development environment can provide some additional
information about the tests that can guide the developer to create the tests and the
test code. An example of a feature present in some tools that can help for a better
test code is automated test smell detection [19, 21]. This information can guide the
developer to create a cleaner code and a more efficient test suite. Another useful
information that can be provided is code coverage [26], which is usually shown
visually in the IDE through the lines of code being covered by the tests or not.
Based on that, the developer might add new tests or modify the existing ones to
achieve the desired test coverage.
5.3 Level 2
Coding At this level, tools not only identify issues but make suggestions that could
be accepted or not by the developer. Regarding coding, tools that leverage similar
scenarios in a limited context to make suggestions of modifications on the code for
the developer are included at this stage. These processes are implemented using
heuristics relying on similarity. A clear example is auto-completion tools in IDEs.
Once the developer starts typing, the tool looks for similar terms from the language
or elements in the project that could complete what the developer wanted to do.
Another example could be automatic refactoring that, for example, has been for long
supported by the Eclipse IDE for Java [40]. In this case, developers select pieces of
3 https://eslint.org/.
4 https://www.pylint.org/.
64 J. Melegati and E. Guerra
code and which type of refactoring they wish to perform; the tool, then, suggests
the refactoring based on pre-defined rules, and the developer can accept or not the
suggested modifications.
Testing Tools that can provide suggestions to create or improve the tests should be
included in this level. Similar to coding, there are suggested refactorings specific to
test code that can be applied automatically by the tools [14, 41]. Another kind of tool
that fits into this level is the creation of test templates. Most IDEs provide the feature
of creating test classes based on the structure of the class under test, suggesting
test methods to be created. The test template generation might also consider other
factors, such as test names [44].
5.4 Level 3
Coding Tools that are able to generate solutions for different SE activities should
be included at this level. For coding, tools that can generate code for a given limited
scenario, generally restricted to a few lines of code and few degrees of freedom,
should be categorized in this level. An example is GitHub Copilot, which creates the
code of a function based on a comment or the signature provided by the developer.
The developer can also select one of many solutions suggested by the tool. GitHub
Copilot’s suggestions have low complexity code that could be further simplified
or that relies on undefined helper methods [18]. It has been also observed that it
struggles to combine methods to form a solution [16].
Testing At this level, test code is automatically generated for limited and specific
contexts, such as unit tests of isolated classes and methods [15]. These tools usually
consider the internal structures of the test target, generating test suites trying to
maximize coverage [5, 10]. Some preliminary research also explored the usage
of LLMs, based on ChatGPT, for unit test generation [42] with promising results.
Even if some results indicate that the coverage of generated test suites is similar to
manually created tests [13], other studies show that several faults were not detected
by these automatically generated tests [27]. For instance, these tools have difficulties
generating tests with good coverage when the inputs need a specific format or type
to reach a given code branch.
5.5 Level 4
Coding Starting from this level on, at the time of writing, no tools fulfill the
needs in a complete way. For coding, a tool at this level should be able to, given a
description, develop the overall solution, including the design and implementation.
In an object-oriented scenario, it would propose the classes, their interactions, and
also the implementation of the methods. However, the tool would not guarantee the
DAnTE: A Taxonomy for the Automation Degree of Software Engineering Tasks 65
correctness, and developers should still check the proposed design and implemen-
tation. Several researchers and practitioners have experimented with ChatGPT to
investigate whether the tool is able to support the development of a software system
architecture, e.g., [1]; however, the results at the time of writing did not support the
capacity of the tool to fulfill this goal. Another example is AgentGPT,5 a tool that
allows the configuration and deployment of AI agents based on a natural language
chat with the developer.
Testing Tools at this level should be able to create tests not only for a class or a
method but considering the whole application. In that scope, we can find tools that
generate functional tests for specific kinds of applications, like Web apps [7] and
REST APIs [2]. For more specific cases, it is possible to find test generation for
non-functional aspects, like a proposal to generate tests to detect leaks in Android
applications [45]. Even if we have some test generation tools that could be classified
at this level for targeting the application as a whole, they are still specific to
platforms and focus the tests on a single aspect.
5.6 Level 5
Since no tool currently reaches this degree of automation for code or test generation,
it is hard to predict how the process of humans interacting with the tools would be
when this level is reached. Based on the currently existing tools that are still in
the previous levels, it is expected that they will follow an iterative process in which,
based on the generated artifact, the user will provide feedback and more information
to evolve it further in the desired direction employing natural language.
Coding At this final stage of the evolution of automatic generation, tools should
be able to completely and reliably develop a software solution based on a natural
language description of the required features. As in the driving automatic scenario
in which, at the last level, it is only needed to define the destination, code generation
tools in Level 5 simply require the goals to generate the final solution. The advance-
ments of LLMs might lead to this stage in medium or long term. Another set of
tools hinting how these tools could be are low-code/no-code platforms [23, 24], such
as Mendix6 and OutSystems.7 By leveraging model-driven engineering, automatic
code generation, cloud infrastructures, and graphical abstractions, these tools allow
the development of fully functional applications by end users with no particular
programming background [24]. However, these platforms require that users model
the domain, define the user interface, and specify business logic [24]. Level 5 tools
5 https://agentgpt.reworkd.ai/.
6 https://www.mendix.com/.
7 https://www.outsystems.com/.
66 J. Melegati and E. Guerra
6 Discussion
When classifying some practices in Levels 4 and 5, we could observe that some
tools are probably going to be able to perform diverse tasks, such as coding and
testing. Therefore, a possible further development of the taxonomy could consider
the capacity of a tool of fulfilling multiple tasks rather than, for example, the point
of application of the AI-SEAL taxonomy.
Ralph [22] identifies five possible errors when creating taxonomies: inclusion,
exclusion, combination, differentiation, and excess. An inclusion error happens
when an instance is included in a class even though it is not similar to other instances
of the class, while an exclusion error occurs when an instance is not grouped
with other instances similar to it. In the utility demonstration we performed, we
presented the differences among the tools that made us grouped them together or
apart. Combination errors happen when not similar instances are grouped together.
In this regard, DAnTE is an improvement regarding the existent classification of
software assistants, in the sense that it differentiates tools with different degrees
of automation. Differentiation errors represent the cases when similar instances are
separated in different classes. Based on our discussion above, we argue that the
proposed taxonomy avoids this error that is possible when using AI-SEAL and its
ten levels of automation by reducing the number of steps and providing an analogy.
Finally, an excess regards cases when a class has no instances. Although Level 5
might fit this definition, it is useful to guide the development of new tools that,
based on the rapid development of generative AI and LLMs, we expect to have
them available in the near future.
7 Conclusions
This chapter presented DAnTE, a taxonomy for the automation degree of SE tasks,
by leveraging a taxonomy for driving automation and reviewing similar works for
SE. We performed an initial evaluation based on utility demonstration, i.e., by clas-
sifying coding and testing tools, reaching promising results. The taxonomy could
help the comparison of diverse tools. It could also help practitioners understand
what they could expect from different tools and facilitate the decision process of
tool selection. The taxonomy also hints the next steps of automation of SE tasks,
guiding practitioners and researchers alike in the development of novel tools.
Given the limited space, we were not able to present the use of the taxonomy
for other SE tasks. However, our preliminary results indicate that it is useful to
classify diverse tools. This classification can support the decision-making processes
of development teams regarding which tools they should adopt, allowing the
comparison among different tools and also between a tool and the development
context. For example, depending on the availability of developers or the specificity
of the domain, tools at different levels might be more adequate. The taxonomy also
guide the development of new tools, providing the requirements that a new tool
should fulfill to different itself at a distinct level from the competitors. Although not
its focus, the taxonomy could bring implications to process taxonomies, such as the
68 J. Melegati and E. Guerra
Capability Maturity Model Integration (CMMI). Future work could investigate how
tools at different levels influence development processes. A comprehensive analysis
might indicate that the emergence of Level 4 or 5 tools could lead to the merge
of different tasks as our evaluation showed, i.e., coding tools that already test the
code produced. The taxonomy could be also further developed, probably including
other dimensions, such as the type of AI applied. For example, the capability of a
tool of performing more than a task, as described above, might indicate a valuable
dimension for the classification of these tools. As the generative AI technology
advances and new tools for SE tasks are proposed, future work could investigate
the addition of new dimensions for the taxonomy.
References
1. Ahmad, A., Waseem, M., Liang, P., Fahmideh, M., Aktar, M.S., Mikkonen, T.: Towards human-
bot collaborative software architecting with ChatGPT. In: Proceedings of the 27th International
Conference on Evaluation and Assessment in Software Engineering, pp. 279–285. ACM, New
York (2023). https://doi.org/10.1145/3593434.3593468
2. Arcuri, A.: RESTful API automated test case generation. In: 2017 IEEE International
Conference on Software Quality, Reliability and Security, pp. 9–20. IEEE, New York (2017)
3. Arcuri, A.: RESTful API automated test case generation with EvoMaster. ACM Trans. Softw.
Eng. Methodol. (TOSEM) 28(1), 1–37 (2019)
4. Bosch, J., Olsson, H.H., Crnkovic, I.: It takes three to tango: requirement, outcome/data, and
AI driven development. In: Software-intensive Business Workshop on Start-ups, Platforms and
Ecosystems (SiBW 2018), pp. 177–192. CEUR-WS.org, Espoo (2018)
5. Braione, P., Denaro, G., Mattavelli, A., Pezzè, M.: Sushi: a test generator for programs with
complex structured inputs. In: Proceedings of the 40th International Conference on Software
Engineering: Companion Proceedings, pp. 21–24 (2018)
6. Cheon, Y., Leavens, G.T.: A simple and practical approach to unit testing: the JML and JUnit
way. In: ECOOP 2002—Object-Oriented Programming: 16th European Conference Málaga,
Spain, June 10–14 Proceedings, pp. 231–255. Springer, Berlin (2002)
7. Dallmeier, V., Pohl, B., Burger, M., Mirold, M., Zeller, A.: WebMate: web application test
generation in the real world. In: 2014 IEEE Seventh International Conference on Software
Testing, Verification and Validation Workshops, pp. 413–418. IEEE, New York (2014)
8. Feldt, R., de Oliveira Neto, F.G., Torkar, R.: Ways of applying artificial intelligence in
software engineering. In: Proceedings of the 6th International Workshop on Realizing Artificial
Intelligence Synergies in Software Engineering, pp. 35–41. ACM, New York (2018). https://
doi.org/10.1145/3194104.3194109
9. Ferreira Campos, U., Smethurst, G., Moraes, J.P., Bonifacio, R., Pinto, G.: Mining rule
violations in JavaScript code snippets. In: 2019 IEEE/ACM 16th International Conference on
Mining Software Repositories (MSR), pp. 195–199. IEEE, New York (2019). https://doi.org/
10.1109/MSR.2019.00039
10. Fraser, G., Arcuri, A.: EvoSuite: automatic test suite generation for object-oriented software.
In: Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference
on Foundations of Software Engineering, pp. 416–419 (2011)
11. Gori, M., Betti, A., Melacci, S.: Learning principles. In: Machine Learning, pp. 53–111.
Elsevier, Amsterdam (2024). https://doi.org/10.1016/B978-0-32-389859-1.00009-X
12. Taxonomy and definitions for terms related to driving automation systems for on-road motor
vehicles. Standard, International Organization for Standardization, Geneva (2021)
DAnTE: A Taxonomy for the Automation Degree of Software Engineering Tasks 69
13. Kracht, J.S., Petrovic, J.Z., Walcott-Justice, K.R.: Empirically evaluating the quality of auto-
matically generated and manually written test suites. In: 2014 14th International Conference
on Quality Software, pp. 256–265. IEEE, New York (2014)
14. Marinke, R., Guerra, E.M., Fagundes Silveira, F., Azevedo, R.M., Nascimento, W., de Almeida,
R.S., Rodrigues Demboscki, B., da Silva, T.S.: Towards an extensible architecture for
refactoring test code. In: Computational Science and Its Applications–ICCSA 2019: 19th
International Conference, Saint Petersburg, Russia, July 1–4, 2019, Proceedings, Part IV 19,
pp. 456–471. Springer, Berlin (2019)
15. McMinn, P.: Search-based software test data generation: a survey. Softw. Test. Verif. Reliab.
14(2), 105–156 (2004)
16. Moradi Dakhel, A., Majdinasab, V., Nikanjam, A., Khomh, F., Desmarais, M.C., Jiang, Z.M.J.:
GitHub Copilot AI pair programmer: Asset or Liability? J. Syst. Softw. 203, 111734 (2023).
https://doi.org/10.1016/j.jss.2023.111734
17. Murphy, G., Kersten, M., Findlater, L.: How are Java software developers using the Eclipse
IDE? IEEE Softw. 23(4), 76–83 (2006). https://doi.org/10.1109/MS.2006.105
18. Nguyen, N., Nadi, S.: An empirical evaluation of GitHub copilot’s code suggestions. In:
Proceedings of the 19th International Conference on Mining Software Repositories, pp. 1–5.
ACM, New York (2022). https://doi.org/10.1145/3524842.3528470
19. Palomba, F., Zaidman, A., De Lucia, A.: Automatic test smell detection using information
retrieval techniques. In: 2018 IEEE International Conference on Software Maintenance and
Evolution (ICSME), pp. 311–322. IEEE, New York (2018)
20. Parsons, J., Wand, Y.: Using cognitive principles to guide classification in information systems
modeling. MIS Q. 32(4), 839 (2008). https://doi.org/10.2307/25148874
21. Peruma, A., Almalki, K., Newman, C.D., Mkaouer, M.W., Ouni, A., Palomba, F.: tsDetect:
an open source test smells detection tool, pp. 1650–1654. ESEC/FSE 2020, Association for
Computing Machinery, New York (2020). https://doi.org/10.1145/3368089.3417921
22. Ralph, P.: Toward methodological guidelines for process theories and taxonomies in software
engineering. IEEE Trans. Softw. Eng. 45(7), 712–735 (2019). https://doi.org/10.1109/TSE.
2018.2796554
23. Rokis, K., Kirikova, M.: Challenges of low-code/no-code software development: a literature
review. In: Lecture Notes in Business Information Processing, vol. 462 LNBIP, pp. 3–
17. Springer International Publishing, New York (2022). https://doi.org/10.1007/978-3-031-
16947-2_1
24. Sahay, A., Indamutsa, A., Di Ruscio, D., Pierantonio, A.: Supporting the understanding and
comparison of low-code development platforms. In: 2020 46th Euromicro Conference on
Software Engineering and Advanced Applications (SEAA), pp. 171–178. IEEE, New York
(2020). https://doi.org/10.1109/SEAA51224.2020.00036
25. Savary-Leblanc, M., Burgueño, L., Cabot, J., Le Pallec, X., Gérard, S.: Software assistants in
software engineering: a systematic mapping study. Softw. Pract. Experience 53(3), 856–892
(2023). https://doi.org/10.1002/spe.3170
26. Shahid, M., Ibrahim, S.: An evaluation of test coverage tools in software testing. In: 2011
International Conference on Telecommunication Technology and Applications Proceedings of
CSIT, vol. 5. sn (2011)
27. Shamshiri, S., Just, R., Rojas, J.M., Fraser, G., McMinn, P., Arcuri, A.: Do automatically
generated unit tests find real faults? an empirical study of effectiveness and challenges (t). In:
2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE),
pp. 201–211. IEEE, New York (2015)
28. Shaw, M.: Prospects for an engineering discipline of software. IEEE Softw. 7(6), 15–24 (1990).
https://doi.org/10.1109/52.60586. http://ieeexplore.ieee.org/document/60586/
29. Sheridan, T.B., Parasuraman, R.: Human-automation interaction. Rev. Hum. Factors Ergon.
1(1), 89–129 (2005). https://doi.org/10.1518/155723405783703082
30. Šmite, D., Wohlin, C., Galviņa, Z., Prikladnicki, R.: An empirically based terminology and
taxonomy for global software engineering. Empir. Softw. Eng. 19, 105–153 (2014)
70 J. Melegati and E. Guerra
31. Svyatkovskiy, A., Deng, S.K., Fu, S., Sundaresan, N.: IntelliCode compose: code generation
using transformer. In: Proceedings of the 28th ACM Joint Meeting on European Software
Engineering Conference and Symposium on the Foundations of Software Engineering, pp.
1433–1443. ACM, New York (2020). https://doi.org/10.1145/3368089.3417058
32. Tanimoto, S.L.: VIVA: a visual language for image processing. J. Vis. Lang. Comput. 1(2),
127–139 (1990). https://doi.org/10.1016/S1045-926X(05)80012-6
33. Tanimoto, S.L.: A perspective on the evolution of live programming. In: 2013 1st International
Workshop on Live Programming, LIVE 2013—Proceedings, pp. 31–34 (2013). https://doi.org/
10.1109/LIVE.2013.6617346
34. Teitelman, W., Masinter, L.: The Interlisp programming environment. Computer 14(4), 25–33
(1981). https://doi.org/10.1109/C-M.1981.220410
35. Tomasdottir, K.F., Aniche, M., van Deursen, A.: Why and how JavaScript developers use lin-
ters. In: 2017 32nd IEEE/ACM International Conference on Automated Software Engineering
(ASE), pp. 578–589. IEEE, New York (2017). https://doi.org/10.1109/ASE.2017.8115668
36. Usman, M., Britto, R., Börstler, J., Mendes, E.: Taxonomies in software engineering: a
Systematic mapping study and a revised taxonomy development method. Inf. Softw. Technol.
85, 43–59 (2017). https://doi.org/10.1016/j.infsof.2017.01.006
37. Vaithilingam, P., Zhang, T., Glassman, E.L.: Expectation vs. experience: evaluating the
usability of code generation tools powered by large language models. In: CHI Conference on
Human Factors in Computing Systems Extended Abstracts, pp. 1–7. ACM, New York (2022).
https://doi.org/10.1145/3491101.3519665
38. Viglianisi, E., Dallago, M., Ceccato, M.: Resttestgen: Automated black-box testing of RESTful
APIs. In: 2020 IEEE 13th International Conference on Software Testing, Validation and
Verification (ICST), pp. 142–152. IEEE, New York (2020)
39. Wheaton, G.R., Fleishman, E.A.: Development of a Taxonomy of Human Performance: A
Review of Classificatory Systems Relating to Tasks and Performance. Clearinghouse, Mumbai
(1968)
40. Xing, Z., Stroulia, E.: Refactoring practice: how it is and how it should be supported—an
eclipse case study. In: IEEE International Conference on Software Maintenance (ICSM), pp.
458–467 (2006). https://doi.org/10.1109/ICSM.2006.52
41. Xuan, J., Cornu, B., Martinez, M., Baudry, B., Seinturier, L., Monperrus, M.: B-refactoring:
automatic test code refactoring to improve dynamic analysis. Inf. Softw. Technol. 76, 65–80
(2016)
42. Yuan, Z., Lou, Y., Liu, M., Ding, S., Wang, K., Chen, Y., Peng, X.: No more manual tests?
evaluating and improving ChatGPT for unit test generation. arXiv preprint arXiv:2305.04207
(2023)
43. Zanchin, B.C., Adamshuk, R., Santos, M.M., Collazos, K.S.: On the instrumentation and
classification of autonomous cars. In: 2017 IEEE International Conference on Systems, Man,
and Cybernetics (SMC), pp. 2631–2636. IEEE, New York (2017). https://doi.org/10.1109/
SMC.2017.8123022
44. Zhang, B., Hill, E., Clause, J.: Automatically generating test templates from test names (n). In:
2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE),
pp. 506–511. IEEE, New York (2015)
45. Zhang, H., Wu, H., Rountev, A.: Automated test generation for detection of leaks in android
applications. In: Proceedings of the 11th International Workshop on Automation of Software
Test, pp. 64–70 (2016)
ChatGPT Prompt Patterns for Improving
Code Quality, Refactoring, Requirements
Elicitation, and Software Design
Abstract This chapter presents design techniques for software engineering, in the
form of prompt patterns, to solve common problems that arise when using large
language models (LLMs) to automate common software engineering activities,
such as ensuring code is decoupled from third-party libraries and creating API
specifications from lists of requirements. This chapter provides two contributions
to research on using LLMs for software engineering. First, it provides a catalog
of patterns for software engineering that classifies patterns according to the types
of problems they solve. Second, it explores several prompt patterns that have
been applied to improve requirements elicitation, rapid prototyping, code quality,
deployment, and testing.
1 Introduction
development environments (IDEs) [1, 2, 18], such as IntelliJ [13] and Visual Studio
Code. Leveraging these capabilities previously required much more time and effort.
In addition, prior state-of-the-art LLMs were not widely accessible to users.
Interacting with an LLM in general involves feeding it “prompts” [14], which
are natural language instructions used to provide context to the LLM and guide its
generation of textual responses. In a chat-assisted LLM environment, a prompt is
a message that a user sends to an LLM, such as ChatGPT, Claude, or Bard. The
remainder of this chapter focuses on the ChatGPT chat-assisted LLM.
In the context of software engineering, a prompt is a natural language instruction
given to an LLM to facilitate its generation of requirements, code, and software-
related artifacts (such as documentation and build scripts), as well as to simulate
certain aspects of a software system. Prompts are thus a form of programming used
to instruct an LLM to perform software engineering tasks. For example, in an IDE
environment (such as Copilot [1]), a prompt can be a comment, method definition,
or source file.
One way to use prompts in the software development life cycle is to ask an LLM
directly to provide some information or generate some code. Another use of prompts
is to dictate rules for the LLM to follow going forward, such as conforming to
certain coding standards. Both types of prompts program the LLM to accomplish a
task. The second type of prompt, however, customizes future interactions with the
LLM by defining additional rules to follow or information to use when performing
subsequent tasks. We cover both types of patterns in this chapter.
Overview of Prompt Patterns for Software Engineering Tasks This chapter
builds on our prior work that introduced the concept of prompt patterns [24],
which are reusable prompt designs to solve problems in LLM interaction. Similar
to software patterns [10, 20], prompt patterns codify sound design experience,
thereby providing a reusable solution to common problems in LLM interaction,
such as ensuring that software adheres to certain design principles, programming
paradigms, and/or secure coding guidelines.
Software developers and engineers can use prompt patterns to establish rules
and constraints that improve software quality attributes (such as modularity or
reusability) when working with LLMs. For example, prompt patterns can ensure
that generated code (or user-provided code being refactored) separates business
logic from code with side effects (e.g., file system access, database access, network
communication, etc.). These types of constraints make business logic easier to test
and reason about since it is decoupled from harder-to-test and harder-to-understand
side-effecting code. Prompt patterns can also require that third-party libraries have
intermediate abstractions inserted between the libraries and the business logic
depending on them to ensure the code is not tightly coupled to external dependencies
that would otherwise limit its portability and reusability.
Toward a Prompt Pattern Catalog for Software Engineering This chapter
extends our prior work [24] by focusing on creating a catalog of prompt patterns
that can be applied collaboratively throughout the software life cycle. We introduce
a variety of prompt patterns in this chapter, ranging from patterns that simulate and
ChatGPT Prompt Patterns for Improving Code Quality, Refactoring,. . . 73
reason about systems early in the design phase to patterns that help alleviate issues
with LLM token limits when generating code. In addition, we explore relationships
between patterns by examining patterns compounds and sequences that are most
effective when employed in combination with each other.
The remainder of this chapter is organized as follows: Section 2 gives an
overview of prompt pattern structure and functionality; Sect. 3 introduces the
catalog of prompt patterns covered in the chapter; Sect. 4 describes prompt patterns
used during requirements elicitation and system design; Sect. 5 describes prompt
patterns that help LLMs generate higher quality code and refactor human-produced
code; Sect. 6 compares our research on prompt patterns with related work; and
Sect. 7 presents concluding remarks and lessons learned.
Prompt patterns are documented using a similar structure to software patterns, with
analogous versions of the name, classification, intent, motivation, structure, example
implementation, and consequences. Each of these sections for the prompt pattern
form is described briefly below:1
– A name and classification. The name provides a unique identifier for the pattern
that can be referenced in discussions and the classification groups the pattern
with other patterns based on the types of problems they solve. The classification
used in this chapter is shown in Table 1.
– The intent and context captures the problem that the pattern solves and the goals
of the pattern.
– The motivation explains the rationale and importance of the problem that the
pattern is solving.
– The structure and key ideas. The structure describes the fundamental contex-
tual information that needs to be provided by the LLM to achieve the expected
behavior. These ideas are listed as a series of statements but can be reworded and
adapted by the user, as long as the final wordings convey the key information.
– Example implementation shows specific implementations of the pattern and
discusses them.
– Consequences discusses the pros and cons of using the pattern and discussion of
how to adapt the pattern for different situations.
Prompt patterns can take various forms. In the context of patterns that enable
LLMs to perform software engineering tasks, a prompt typically starts with a
conversation scoping statement, such as “from now on,” “act as a X,” “for the next
1 Our prior work [24] defines the fundamental structure of a prompt pattern and compares it with
software patterns. We briefly define prompt patterns for completeness below, but we refer the reader
to our prior work for additional details.
74 J. White et al.
four prompts,” etc. These statements direct the LLM to change its operation going
forward based on the prompt being provided. For example, the following prompt
pattern is an adaptation of the Output Automater pattern [24] that uses “from now
on” to automate production of a list of dependencies for generated code:
from now on, automatically generate a python requirements.txt file that includes any
modules that the code you generate includes.
After the initial conversational scoping statement, the prompt includes a number
of statements that provide the ground rules the LLM should follow in output
generation and prompt flow for software engineering tasks. These output rules may
include one or more conditional statements indicating when specific rules should be
applied. For example, the following prompt:
From now on, whenever you generate code that spans more than one file, generate a python
script that can be run to automatically create the specified files or make changes to existing
files to insert the generated code.
Normally, a user must manually open and edit multiple files to add generated
code that spans multiple files to a project. With the above prompt, ChatGPT will
generate a script to automate opening and editing each file for the user and eliminate
potential manual errors. The prompt is scoped to “from now on” and then includes
a conditional “whenever you generate code that spans more than one file,” followed
by the rule to generate a python script. This prompt form is an example of the Output
Automator pattern from [24], applied to software engineering.
ChatGPT Prompt Patterns for Improving Code Quality, Refactoring,. . . 75
This section summarizes our catalog of 14 prompt patterns that have been applied
to solve common problems in the domain of conversational LLM interaction and
output generation for automating common software tasks. We partitioned these 14
prompt patterns into four categories to help pattern users navigate and apply these
patterns more effectively. Table 1 outlines the initial classifications for the catalog
of prompt patterns for automating software engineering tasks identified by our work
with ChatGPT (GPT 3.5-turbo GPT-4).
Two areas of LLM usage in the domain of software engineering that have
received scant attention thus far include (1) requirements elicitation and (2)
system design and specification. These areas represent some of the most important
aspects of software engineering, however, and commonly yield changes late in the
development cycle that cause schedule overruns, unanticipated costs, and risk. The
Requirements Elicitation patterns listed in Table 1 aid in creating requirements
and exploring their completeness with respect to desired system capabilities and
accuracy. Other patterns in this category use an LLM as a trusted intermediary to
reason about the impact of changes.
The System Design and Simulation Patterns patterns category listed in Table 1
explores patterns that address issues creating concrete design specifications,
domain-specific languages, and exploring alternative architectures. The section
demonstrates ways to simulate aspects of a system to help identify deficiencies
early in the life cycle, i.e., when they are less costly and disruptive to remediate.
Considerable concern [2, 5, 18] has risen regarding the quality of code produced
by LLMs, as well as written via collaborations between humans and LLMs. The
Code Quality patterns category introduces several patterns that improve both LLM
and human-generated code. LLMs can often reason effectively about abstraction,
as well as generate relatively modular code. The patterns listed in this category in
Table 1 help ensure certain abstraction and modularity attributes are present in code,
e.g., they facilitate replacement of third-party libraries by introducing an interface
between them and business logic.
Finally, the Refactoring patterns listed in Table 1 provide various means to
refactor code using LLMs effectively. LLMs like ChatGPT have a surprisingly
powerful understanding of abstract coding constructs, such as pseudo-code. Inno-
vative pattern-oriented approaches to refactoring are therefore discussed to allow
specification of refactoring at a high level, such as using pseudo-code to describe
code structure.
All examples in this chapter were tested with the ChatGPT LLM. Our process
for identifying and documenting these patterns combined exploring community-
posted prompts on the Internet and creating independent prompts to support our
own software engineering work with ChatGPT. Our broader goal is to codify a
catalog of software engineering prompt patterns that can be easily adapted or reused
for various LLMs, much like how classic software patterns can be implemented
independently in various programming languages.
76 J. White et al.
This section describes prompt patterns used during requirements elicitation and
system design.
Motivation
Example Implementation
This pattern showcases two unique exploration methods. First, the task-based
exploration handles tasks and investigates their feasibility. Second, the screen-based
exploration focuses on helping the user imagine an interface-driven walk-through
of system functionalities akin to a textual adventure game. Each has its merits
depending on the use case; task-based explorations are better for back-end services,
whereas screen-based explorations are suited for front-end or user-interface heavy
applications.
Consequences
of the system gains fidelity throughout the simulation. This iterative process allows
rapid generation of large requirement sets.
However, a limitation of using this pattern in isolation is that the descriptions of
the system are purely textual, which may provide insufficient information to capture
all requirements accurately. One way to overcome this limitation and help write
better requirements and later generate design mockups is to combine this pattern
with the Visualization Generator pattern [24] This combination of patterns enables
simulation of the system, as well as possible visualizations of each screen generated
by another AI tool, such as Dall-E, Midjourney, or Stable Diffusion. A modified
prompt that combines the Requirements Simulator and Visualization Generation
patterns would be:
. . . In addition to the textual screen description, provide a Dall-E prompt that I can use to
generate wireframes of what the screen might look like.
The resulting wireframe generated with Midjourney rather than Dall-E is shown
in Fig. 1. This figure shows the generated image is a rough interpretation of possible
realizations of the prompt. However, such visualizations may be easier to reason
about than textual descriptions, which can spur productive discussions between
participants. In addition, the randomness in image generation may yield unexpected
layouts or user interface approaches that may inspire new ideas.
Fig. 1 Wireframe samples generated from combining the requirements simulator and visualiza-
tion generator patterns
Motivation
a loss of meaningful context in the capture and delivery of those requirements. The
Specification Disambiguation pattern helps automate some of the work needed to
clarify requirement specifications, ideally before they are delivered to developers.
Example Implementation
By providing the prompt above along with the captured requirements, the LLM
can provide results of interest to the product manager. For example, if some
requirement specifications were:
1. Create a user interface that allows file uploads via Web browser.
2. Keep the files secure.
3. Make sure the files are movie file types.
4. Files can be no larger than 100 MB
The LLM would provide a path for the product manager to bolster the require-
ment specifications prior to providing them to the development team. For example,
the LLM might suggest
The system requirement to “Keep the files secure” could be made more precise by specify-
ing which security measures should be implemented to ensure the files are kept secure. For
example: “Implement access control and encryption measures to ensure that uploaded files
can only be accessed by authorized users and are protected against unauthorized access or
tampering.”
ChatGPT Prompt Patterns for Improving Code Quality, Refactoring,. . . 81
Consequences
Motivation
Example Implementation
The implementation uses a concrete format for the specification, OpenAPI, and
assumes that the requirements for the system were previously discussed. Typically,
the API Generator pattern is used after a discussion of the requirements or even a
simple textual description of a system, such as “a web application for a customer
relationship management system.” The more detailed the list of requirements, the
more accurate the generated API will be, although developers can perform thought
experiments and simulations with prompts as simple as “generate an OpenAPI
specification for a web application related to cooking.”
Consequences
The API Generator pattern enables developers and/or teams to rapidly create
multiple potential API designs and compare/contrast them before selecting their
final design. In contrast, writing these APIs manually is tedious, so developers often
only have time to write and explore a limited number of API design options. In
general, providing developers with tools to experiment with different API designs
from a system description or requirements list is a powerful tool.
Another benefit of the API Generator pattern is that developers may choose
to write the API specification after the code is implemented because they do
not want to spend time specifying the same information twice (i.e., once in the
API specification and again in the actual code). By automating API production,
developers are incentivized to create API specifications earlier in the design process.
Although existing (i.e., non-LLM) tools can generate an API specification from
code, they still require the initial production of code. Moreover, tools that can
generate skeletons of code from the API specification can be combined with this
pattern to accelerate the API implementation compare with writing it manually.
This pattern can be combined effectively with the API Simulator pattern
described in Sect. 4.4 to both generate and evaluate the proposed specification.
Simulating the API can allow developers to get a sense of the “ergonomics” of
the API and evaluate how hard it is to accomplish various tasks in code. The API
can also be refactored through the LLM using the Data-guided Refactoring pattern
described in Sect. 5.6.
Applying the API Generator pattern provides the benefits to the design and
development process discussed above. As with all tools and processes, however, it
also has potential drawbacks and considerations. For example, LLMs may produce
inconsistent API designs when given similar or updated requirements. These
inconsistencies make it hard to maintain consistency across a project or large code
base and could potentially lead to confusion during the comparison and selection of
final API design or in ensuring different teams use a consistent API design strategy.
The LLM will likely require significant context to ensure consistency across API
generations.
84 J. White et al.
The API Simulator pattern causes the LLM to simulate the API from a specification,
thereby enabling developers to interact immediately with an API and test it
through a simulation run by the LLM. LLMs possess powerful—and often largely
unrecognized—capabilities to generate synthetic data and tailor that data to natural
language descriptions of scenarios. In addition, LLMs can help developers explore a
simulated API by synthesizing sample requests, as well as providing usage guidance
and explaining errors.
Motivation
Although tools are available to simulate an API [12], they require setup to use and
may have limited ability to generate effective synthetic data. Current infrastructure
for simulating APIs also often just supports strict interaction, typically through
HTTP or code, rather than a more fluid interface based on a combination of pseudo-
operations and concrete operation specification. Early interaction with an API
design can aid developers in uncovering issues, omissions, and awkward designs.
Example Implementation
simulation, data saved, etc. Similarly, the specification of the user input could be
simply a description of what a user is doing with the API or a Web client.
Act as this web application based on the OpenAPI specification. I will type in HTTP
requests in plain text and you will respond with the appropriate HTTP response based on
the OpenAPI specification.
Consequences
The API Simulator pattern enables users to customize their interactions or scenarios
of interaction with an LLM using natural language, which may be easier than trying
to accomplish the same thing in code. For example, users can tell the LLM, “for
the following interactions, assume that the user has a valid OAuth authorization
token for the user Jill” or “assume that 100 users exist in the system and respond
to requests with sample data for them.” More complex customization can also be
performed, such as “assume the users are from a set of 15–25 families and come
from multiple countries in North America.”
Interactions with a simulated API can be done through either a rigorous
programmatic form, such as “strictly interpret my input as an HTTP request and
reject any malformed requests” or “I am going to only provide pseudo data for the
input and you should fix any formatting issues for me.” The flexibility of interacting
with the LLM simulation and customizing it via natural language facilitates rapid
exploration of an API.
Another benefit of the API Simulator pattern arises when combining it with other
patterns so users can (1) have the LLM create examples of usage that are later used
as few-shot examples [23] in future prompts or (2) leverage the LLM to reason about
how hard or easy it is to accomplish various tasks in code. In particular, combining
this pattern with the Change Request Simulation pattern described in Sect. 4.8 allows
users to reason about the effort needed to accommodate changing assumptions later
in the software life cycle.
One important consideration is that the simulation will not be completely
accurate, so fine-grained analysis cannot be performed. It is important to ensure
that the analysis done with the simulation is at a high level and mainly used for
thinking through how interactions with the interface will work, sequencing, etc.
Users should not infer other properties of the interface, such as performance, based
on the simulation.
The API Simulator pattern, like any tool or process, has potential drawbacks and
considerations, despite its advantages. For example, users might start relying heavily
on simulated results, disregarding the fact that simulations may not accurately reflect
86 J. White et al.
the behavior of a deployed API. This overreliance could lead to unrealistic expecta-
tions about system behavior or performance. Moreover, this pattern is driven by the
LLM, which is not designed to provide in-depth, detailed analysis of API behavior,
such as exact performance metrics and fine-grained error conditions. Therefore,
users may erroneously assume that the simulated performance is representative of
the actual API when implemented.
The Few-shot Code Example Generation pattern instructs the LLM to generate a
set of usage examples that can later be provided back to the LLM as examples in a
prompt to leverage few-shot learning [23], which provides a limited set of example
training data in a prompt to an LLM. In the software domain, the few-shot examples
are proper usage of code that an LLM can learn from. In some cases, these examples
can convey the function and use of code in a more space/token-efficient manner than
the actual code itself.
This pattern leverages the LLM itself to generate few-shot examples that can
later be provided in a prompt, in lieu of writing the actual code. In response, an
LLM can reason about the original code more effectively. These examples can be
helpful to remind the LLM of the design/usage of the system that it designed in prior
conversations.
Motivation
LLMs typically have a limit on the number of tokens (e.g., words or symbols) that
can be input in a single prompt. Since a large software system or module may exceed
an LLM’s token limit, it may be necessary to describe design or programming
aspects (such as a module, class, set of functions, etc.) within the LLM’s token
limit to use it properly. This overrunning of the token limit necessitates a way to
remind the LLM of prior decisions it made in the past.
One approach to solve the token limit problem is to provide few-shot training
examples in a prompt that are based on the usage of the code, API, state transitions,
or other specification usage examples. These examples can demonstrate proper
usage and train the LLM on the fly to use the related design or code properly.
However, manually generating few-shot examples may not be feasible, which
motivates the Few-shot Code Example Generation pattern.
ChatGPT Prompt Patterns for Improving Code Quality, Refactoring,. . . 87
Example Implementation
These examples could then be used in later prompts to the LLM to remind it of
the design of the API with regard to the creation of users. Providing the examples
may be more concise and convey more meaning than natural language statements
that try to convey the same information.
In another example, the Few-shot Code Example Generation pattern implemen-
tation asks the LLM to create few-shot examples for usage of portion of code:
I am going to provide you code. Create a set of 10 examples that demonstrate usage of this
code. Make the examples as complete as possible in their coverage. The examples should
be based on the public interfaces of the code.
One valuable use of the LLM examples is to teach the LLM about the same code
in future interactions. It is common to need to teach an LLM about code, and one
way to teach it is with natural language documentation. However, documentation is
often not information dense and can use significant context in a prompt to the LLM.
In contrast, code examples can be information-rich and token-efficient relative to
natural language, particularly when the examples convey important meaning, such
as ordering of operations, required data, and other details, which are concise when
described in code but overly verbose in natural language. The Few-shot Code
Example Generation pattern aids in creating examples that can be provided to the
LLM to either remember or learn how to use a library, interface, or other code
artifact.
88 J. White et al.
Consequences
The Few-shot Code Example Generation pattern can be used early in the design
cycle to help capture expected usage of a system and then later provide a usage-
based explanation back to the LLM to highlight its past design decisions. When
combined with patterns, such as the API Simulator pattern, developers can rapidly
interact with the system and record the interactions and then supplement them with
additional generated examples.
This pattern is best applied when example usage of the system also conveys
important information about constraints, assumptions, or expectations that would
require more tokens to express in another format, such as a written natural language
description. In some cases, a document, such as an OpenAPI specification, may be
more token-efficient for conveying information. However, example usage has been
shown to be an effective way of helping an LLM perform problem-solving [27],
so this pattern may be a useful tool even when it is not the most token-efficient
mechanism for conveying the information.
The Few-shot Code Example Generation pattern yields many advantages, par-
ticularly helping LLMs understand and operate on code. However, developers
may face challenges in creating code examples that cover all possible use cases,
scenarios, or edge cases. As a result, the LLM may hallucinate when faced with
situations not covered by the examples, which can negatively affect downstream
behaviors since inaccurate code examples or examples containing bad practices
could mislead the LLM later. Likewise, the LLM could amplify these mistakes by
using those examples as the basis for future code generations, potentially creating
bigger problems.
The Domain-Specific Language (DSL) Creation pattern enables an LLM to create its
own domain-specific language (DSL) that both it and users can leverage to describe
and manipulate system concepts, such as requirements, deployment aspects, security
rules, or architecture in terms of modules. The LLM can then design and describe
the DSL to users. In addition, the examples and descriptions the LLM generates can
be stored and used in future prompts to reintroduce the DSL to the LLM. Moreover,
the examples the LLM generates will serve as few-shot examples for future prompts.
Motivation
languages, or other formats [19]. LLMs have a maximum number of “tokens,” which
corresponds to the maximum size of a prompt, and creating more token-efficient
inputs is important for large software projects where all the needed context may
be hard to fit into a prompt. Creating a DSL, however, can be time-consuming.
In particular, the syntax and semantics of the DSL (e.g., its metamodel) must be
described to an LLM a priori to enable subsequent interactions with users.
Example Implementation
Consequences
The Domain-Specific Language (DSL) Creation pattern may facilitate system design
without violating token limits. The specific syntax rules that are given to the LLM
must be considered carefully, however, since they directly influence the space
90 J. White et al.
efficiency of the generated DSL. Although users of a DSL may only need to express
relevant concepts for a designated task, this high concept density may not translate
into the token efficiency of a textual representation of such concepts. For example,
an XML-based syntax for a DSL will likely be much more space consumptive than
a YAML-based syntax.
Token efficiency in a DSL design can be improved via conventions and implicit
syntax rules. For example, positional conventions in a list can add meaning
rather than marking different semantic elements in the list via explicit labels. The
downside, however, is that the DSL may be harder to interpret for users unfamiliar
with its syntax, although this problem can be rectified by using the Few-shot Code
Example Generation pattern in Sect. 4.5 to create examples that teach users how
to apply the DSL. Combining these two pattern also helps the LLM self-document
usage of the pattern for later prompting based on the DSL.
The Domain-Specific Language (DSL) Creation pattern provides substantial
benefits, particularly when dealing with complex systems that might require mul-
tiple DSLs for different aspects. However, there are potential drawbacks that may
arise. For example, introducing multiple DSLs may increase the overall complexity
of a system, particularly for human developers that did not create the DSLs.
Understanding and maintaining multiple DSLs, and their references, can be hard
for humans over a project life cycle. Each DSL has its own structure and syntax
rules, which developers need to understand to supervise the LLM effectively.
Motivation
Example Implementation
Consequences
The Change Request Simulation pattern helps users reason about the complexity
of a proposed system change, which could be related to requirements, architecture,
performance, etc. For example, this pattern helps users reason about what impact
a given change might have on some aspect of the system, such as which modules
might need changing. This pattern is particularly helpful when a group of stake-
holders need to discuss a possible requirements change, where the LLM serves as a
(potentially) unbiased estimator of the scope and impact of the change.
ChatGPT Prompt Patterns for Improving Code Quality, Refactoring,. . . 93
Motivation
In many situations, it may not be immediately clear to stakeholders what the impact
of a change would be. Without an understanding of a change’s impact, however,
it is hard to reason about the associated effects on schedule, cost, or other risks.
Getting rapid feedback on potential impacts can help stakeholders initiate the
appropriate conversations and experiments to better determine the true risk of the
change. Distrust between users may also complicate the discussion of the change
and necessitate an “unbiased” external opinion.
Example Implementation
The prompt above focuses the simulation on how the change will impact various
functions and files in the system. This approach allows stakeholders to estimate the
cost of a change by examining the complexity of the referenced files, functions, and
the total count of each. Alternatively, in cases where the entire affected section of
code can fit into the prompt, the LLM can be asked to identify lines of code that
may require changing.
94 J. White et al.
Consequences
The hardest part of applying the Change Request Simulation pattern is establishing
enough context for the LLM to reason about a proposed change. This pattern works
best, therefore, when it is employed with other System Design category patterns,
such as the API Generator in Sect. 4.3, where conversation history can be used
to seed the analysis. The more concrete the change description is in relation to
the context, the more likely the LLM can provide a reasonable estimate of change
impact.
The Change Request Simulation pattern can also be used to reason either (1)
abstractly about a software-reliant system in terms of modules or (2) concretely
in terms of files, functions, and/or lines of code. Existing LLMs have token limits
that only consider a limited amount of information about a system. Large sweeping
changes to a system can generally only be reasoned about at a higher level of
abstraction since the detailed information needed to implement such changes would
exceed an LLM’s token limit. Within a smaller set of files or features, however, an
LLM may be able to reason precisely about what needs to change.
One way to handle the tension between token limits and detailed output is to
apply the Change Request Simulation pattern iteratively to zoom in and out. Initially,
an abstract analysis is performed to identify features, modules, etc. that need to
change. The prompt is then modified to refine the context to a specific module or
feature and obtain greater detail from the LLM. This process can be repeated on
individual parts of the module or feature until sufficient detail is obtained. Likewise,
this process can be repeated for each high-level module to estimate the overall
impact of a proposed change.
Applying the Change Request Simulation pattern is reliant on providing the LLM
with enough context to reason about the proposed changes. The more detailed this
context is, the better the LLM can evaluate the possible impact of the change.
Providing enough context can be hard as users must present clear, detailed, and
specific scenarios. Moreover, the fidelity of the LLM’s output depends on the
accuracy of the context given.
This section describes prompt patterns that help LLMs generate higher-quality code
and refactor human-produced code.
ChatGPT Prompt Patterns for Improving Code Quality, Refactoring,. . . 95
The Code Clustering pattern separates and clusters code into packages, classes,
methods, etc. based on a particular property of the code, such as separating
pure code (i.e., code with no side effects) and impure code (i.e., code with side
effects) [22], business logic from database access, HTTP request handling from
business logic, etc. The Code Clustering pattern defines the expected cluster
properties to the LLM and then asks the LLM to restructure the code automatically
to realize the desired clustering. This pattern helps ensure that LLM-generated code
exhibits desired the clustering and can also be used to refactor human-produced
code to add the clustering that wasn’t originally present.
Motivation
How software is decomposed and clustered into packages, classes, methods, etc.
has a significant impact on how easily the code can be changed, extended, and
easily maintained. By default, an LLM will not have guidelines on the decompo-
sition/clustering needs for an application. This lack of context can lead an LLM to
produce code that appears monolithic, brittle, messy, and generally low quality. A
key motivation for applying the Code Clustering pattern, therefore, is to provide the
missing clustering context an LLM needs to generate higher quality code.
Example Implementation
Consequences
The Code Clustering pattern can dramatically improve the quality of LLM-produced
code. Unless the LLM is told otherwise, its code will solve the problem at hand
and often does not solve structuring problems, such as separating pure and impure
functions, that it has not been asked to solve. The Code Clustering pattern surfaces a
key issue in LLM software engineering, i.e., an LLM’s output is only as good as the
prompt it is given. Implicit knowledge (e.g., the project requires code that exhibits
certain clustering properties) will not be known to the LLM unless this information
is provided to it in a prompt.
The Code Clustering pattern depends crucially on understanding implicit
requirements, such as coding conventions, architectural constraints, separation
ChatGPT Prompt Patterns for Improving Code Quality, Refactoring,. . . 97
Motivation
Be default, LLMs often generate code that is highly procedural and directly
translates the requirements spelled out in the prompt into code. In particular, the
implementation may not have sufficient abstraction or modularity, making it hard
to maintain. Moreover, as an LLM is continually prompted to add features to the
code, it may produce increasingly long methods with little separation of concepts
into packages, classes, or other abstractions that facilitate long-term maintainability.
Example Implementation
Consequences
The Code Clustering pattern can aid in producing high-quality code. By explicitly
instructing an LLM to structure its output in a certain way (e.g., separating pure and
impure functions), the resulting code can adhere to best practices and be easier to
understand and maintain. Another benefit of this pattern is that it can help ensure
consistency, particularly in large projects where multiple LLMs or even human
developers might be involved. Consistent code is more predictable and easier to
understand.
However, the Code Clustering pattern may not be able to design a good
abstraction simply from analysis of a single third-party library that provides a given
capability. For example, different dependencies may have different fundamental
architectures and interfaces. One way to address this heterogeneity is to leverage
the Few-shot Example Generator pattern in Sect. 4.5 to create examples of other
comparable third-party libraries and their usage and then ask the LLM to refactor
the interface so it can be implemented with any alternatives.
ChatGPT Prompt Patterns for Improving Code Quality, Refactoring,. . . 99
The Principled Code pattern uses well-known names for coding principles to
describe the desired code structure without having to explicitly describe each
individual design rule. For example, an organization may want to ensure that their
code follows SOLID [15] design principles.2 The goal is to ensure that generated,
refactored, and reviewed code adheres to the expected design principles.
Motivation
Example Implementation
2 The SOLID code refers to software that applies the following design principles: (1) Single respon-
sibility, (2) Open-closed, (3) Liskov substitution, (4) Interface segregation, and (5) Dependency
inversion.
100 J. White et al.
This example uses the SOLID design principles as the desired design guidelines.
This named design methodology informs the LLM of the underlying principles that
code it generates should follow.
Consequences
The Principled Code pattern works best with LLMs trained on a substantial volume
of written material that explains the application of the named principle to a range
of code bases. The more well-known the design principle(s), the more examples
the LLM will likely have been trained on. The availability of training examples
is particularly important for less mainstream languages or languages with more
uncommon designs, such as Prolog or Clojure.
This situation is similar to the Persona Pattern presented in our prior work [24],
where users describe the desired output using a well-known name. A consequence
of the Principled Code pattern is that it only works with well-known named
descriptions of code qualities that existed before the LLM was trained. Newer
coding or design styles that came after the training date will not be accessible
through this pattern. However, other approaches could be used to leverage in-context
learning and few-shot examples to incorporate these inaccessible named coding or
design styles.
The Hidden Assumptions pattern has an LLM identify and describe any assumptions
made in a section of code. The pattern helps users identify these assumptions
or remind them of assumptions they may have forgotten about. By showing key
assumptions from the code to users, the LLM can help ensure users account for
these assumptions in their decisions related to their code.
Motivation
Example Implementation
This first example focuses on listing assumptions that may be hard to change in
the future. This refinement of the pattern helps make developers aware of liabilities
in the code with respect to future change. If one of the assumptions is hard to
change—but developers expect this aspect will need to change—they can request
the LLM to refactor it to remove this flawed assumption.
A second example of the Hidden Assumptions pattern shows how it can be used
to aid in refactoring code from being tightly coupled to an underlying database:
List the assumptions in this code that make it hard to change from a MongoDB database to
MySQL.
With this example, the LLM will list assumptions that are tightly coupling to a
specific database. User could then take this list and use it as the basis for refactoring,
e.g., by asking the LLM to refactor the code to eliminate the listed assumptions.
Consequences
However, the Hidden Assumptions pattern may not identify all hidden assump-
tions in the code. For example, there may be code outside of what is in the context
provided to the LLM that is needed to identify the assumption. The risk of this
pattern is that developers will take it as a source of truth for all assumptions in
the code, rather than flagging of some possible assumptions for consideration by
developers.
The Pseudo-code Refactoring pattern give users fine-grained control over the algo-
rithm, flow, or other aspects of the code, while not requiring explicit specification
of details. It also allows users to define pseudo-code for one or more details of
generated or refactored code. The LLM is expected to adapt the output to fit the
pseudo-code template while ensuring the code is correct and runnable.
Motivation
In many cases, users may have strong opinions or specific goals in the refactoring
or generation of code that are tedious to describe—and duplicative of an LLM’s
work—if it were necessary to type the exact code structures they wanted. In
particular, the benefit of using an LLM is reduced if developers must do as much
coding work as the LLM to specify what they want. The motivation of the Pseudo-
code Refactoring pattern is to provide a middle ground that allows greater control
over code aspects without explicit coding and considering all the details.
Example Implementation
In this example, the prompt is asking the LLM to refactor a much larger body of
code to match the structure of the pseudo-code. The pseudo-code defines the outline
of the code, but not the details of how individual tasks are accomplished. In addition,
the pseudo-code does not provide exact traceability to which lines are part of the
described functionality. The LLM determines what the intent of the refactoring is
and how to map it into the current code provided to it.
Consequences
The Pseudo-code Refactoring pattern enables more precise control over code
generation or refactoring. It empowers users to influence the algorithm’s flow,
structure, or other aspects without going into exhaustive details. The pattern also
boosts efficiency by enabling users to outline pseudo-code structures that the
LLM can populate. This pseudo-code reduces duplicate work and saves time since
developers need not program the full code structure that they desire.
The Pseudo-code Refactoring pattern can trigger more substantial refactoring
than what is outlined in the pseudo-code. For example, rewriting the code to match
the pseudo-code may require the LLM to remove a method and split its code
between two other methods. However, removing the method could then change the
public interface of the code.
The Data-guided Refactoring pattern allow users to refactor existing code to use
data with a new format. Rather than specifying the exact logic changes needed to
use the new format, users can provide the new format schema to the LLM and ask the
104 J. White et al.
LLM to make the necessary changes. This pattern helps automate code refactoring
for the common task of incorporating changes to data formats.
Motivation
Refactoring code to use a new input or output data structure can be tedious.
When communicating with an LLM, explaining the explicit refactoring steps to
may also require more time than actually conducting the needed steps. The Data-
guided Refactoring pattern provides a concise way of explaining to the LLM
what refactoring is needed. Since changing data formats can have such a large-
scale impact on a system, this pattern can automate these refactorings, potentially
reducing costs and accelerating overall system development.
Example Implementation
This example asks the LLM to refactor a method to use a different format for
the graph. In the example, the specific use of the graph format is not defined, but
could potentially be input, output, or internal to the method. All the different uses
of the graph would be supportable by the pattern. In addition, the implementation
uses ellipses to indicate portions of the data structure, which allows the LLM to fill
in user intent with concrete details.
ChatGPT Prompt Patterns for Improving Code Quality, Refactoring,. . . 105
Consequences
The Data-guided Refactoring pattern reduces the manual effort to specify refactor-
ing for many types of code changes necessitated by a change in data format. In many
cases, the refactoring can be completely automated through this process, or at least
boostrapped, thereby accelerating and potentially reducing the cost of changing data
formats.
While LLMs can operate with many different schemas and refactor code
accordingly, they may struggle with complex or niche data formats different from
what the LLM saw in its training data. Developers may therefore need to provide
additional guidance or instruction in these cases. Moreover, an LLM might not
understand the full implications of the data format change and may not refactor all
necessary parts of the code, leading to an inconsistent codebase that uses multiple
data formats at different points in the code.
6 Related Work
Software patterns [10, 20] have been studied extensively and shown their value in
software engineering. Software design patterns have also been specialized for other
types of non-traditional uses, such as designing smart contracts [26, 28]. Prompt
design patterns for software engineering are complementary to these, although not
focused on the design of the system itself but on the interactions with an LLM to
produce and maintain software-reliant systems over time.
Prompt engineering is an active area of study, and the importance of prompts is
well understood [7]. Many problems cannot be solved by LLMs unless prompts are
structured correctly [9]. Some work has specifically looked at prompting approaches
to help LLMs learn to leverage outside tooling [27]. Our work complements these
approaches, focusing on specific patterns of interaction that can be used to tap into
LLM capabilities to solve specific problems in software engineering.
Much discussion on LLM usage in software engineering to date has centered
on the use of LLMs for code generation and the security and code quality risks
associated with that usage. For example, Asare et al. [2] compared LLM code
generation to humans from a security perspective. Other research has examined the
quality of generated answers and code from LLMs [5, 9, 11, 16] and interaction
patterns for fixing bugs [21, 25]. Our research draws inspiration from these
explorations and documents specific patterns that can be used to improve code
quality and help reduce errors. Moreover, as more prompt patterns are developed,
different patterns can be quantitatively compared to each other for effectiveness in
solving code quality issues.
106 J. White et al.
7 Concluding Remarks
Ever since ChatGPT was released to the public in November 2022, much atten-
tion [1, 2, 18] has focused on the mistakes that LLMs make when performing
software engineering tasks particularly with respect to generating code with defects
and/or security vulnerabilities. As shown in this chapter, however, prompt patterns
can be used to help alleviate many of these mistakes and reduce errors. Moreover,
prompt patterns can tap into LLM capabilities that are hard to automate using exist-
ing technologies, including simulating a system based on requirements, generating
an API specification, and pointing out assumptions in code.
The following are lessons learned thus far from our work on applying ChatGPT
to automate common software engineering tasks:
– Prompt engineering is crucial for unlocking the full capabilities of LLMs
for software engineering tasks. The prompt patterns described in this chapter
codify effective prompt design techniques that can help address common soft-
ware engineering challenges. Thoughtful prompt design is key to tapping into
ChatGPT’s strengths.
– Prompt patterns enable rapid exploration and experimentation throughout
the software life cycle. Patterns like the API Generator, API Simulator, and
Architectural Possibilities allow developers to experiment rapidly with multiple
designs and approaches early in the life cycle. This agility can accelerate
development and lead to better solutions.
– Integrating prompt patterns together into pattern catalogs can further
enhance their effectiveness. This chapter discusses chaining prompt patterns
together, such as combining the Requirements Simulator and Visualization
Generator patterns. These sequences and combinations of patterns can build
upon each other to accomplish more complex goals.
– The depth of capabilities of LLMs, such as ChatGPT, is not widely or
fully understood or appreciated. LLMs hold immense potential for helping
to automate common tasks throughout the software engineering life cycle. Many
LLM capabilities have the potential to accelerate software engineering, not just
by generating code but by making rapid experimentation at many different levels
of abstraction possible. A key to leveraging these capabilities is to codify an
effective catalog of prompts and guidance on how to combine them at different
stages of the software life cycle to improve software engineering.
– Significant human involvement and expertise is currently necessary to
leverage LLMs effectively for automating common software engineering
tasks. The tendency of ChatGPT to “hallucinate” confidently and enthusiastically
when generating incorrect output requires close scrutiny from human users at this
point. While prompt patterns can help mitigate some of these issues, much further
work is needed on other aspects of prompt engineering (such as quality assurance
and versioning) to ensure output of LLMs is accurate and helpful in practice.
ChatGPT Prompt Patterns for Improving Code Quality, Refactoring,. . . 107
References
21. Sobania, D., Briesch, M., Hanna, C., Petke, J.: An analysis of the automatic bug fixing
performance of ChatGPT. arXiv preprint arXiv:2301.08653 (2023)
22. Wadler, P.: The essence of functional programming. In: Proceedings of the 19th ACM
SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 1–14 (1992)
23. Wang, Y., Yao, Q., Kwok, J.T., Ni, L.M.: Generalizing from a few examples: a survey on few-
shot learning. ACM Comput. Surv. (CSUR) 53(3), 1–34 (2020)
24. White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith,
J., Schmidt, D.C.: A prompt pattern catalog to enhance prompt engineering with ChatGPT.
arXiv preprint arXiv:2302.11382 (2023)
25. Xia, C.S., Zhang, L.: Conversational automated program repair. arXiv preprint
arXiv:2301.13246 (2023)
26. Xu, X., Pautasso, C., Zhu, L., Lu, Q., Weber, I.: A pattern collection for blockchain-based
applications. In: Proceedings of the 23rd European Conference on Pattern Languages of
Programs, pp. 1–20 (2018)
27. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: React: synergizing
reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022)
28. Zhang, P., White, J., Schmidt, D.C., Lenz, G.: Applying software patterns to address interop-
erability in blockchain-based healthcare apps. CoRR abs/1706.03700 (2017). http://arxiv.org/
abs/1706.03700
Requirements Engineering Using
Generative AI: Prompts and Prompting
Patterns
1 Introduction
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 109
A. Nguyen-Duc et al. (eds.), Generative AI for Effective Software Development,
https://doi.org/10.1007/978-3-031-55642-5_5
110 K. Ronanki et al.
choice of LLM, which we accessed through the API to perform the RE tasks.
The two RE tasks we chose to focus on in this study are (a) binary requirements
classification and (b) identifying requirements that are dependent on each other
(requirements traceability task). We measure the performance of the patterns’
implementation in these tasks using measures like precision, recall, accuracy, and
F-score. We evaluate the performance of the model at different temperature settings
to understand its effect on prompt patterns’ performance. We recommend prompt
patterns that achieve the best performance scores for the selected RE tasks in our
experimental configuration. We also propose a framework to evaluate a prompt
pattern’s effectiveness for any RE task based on the methodology we employed
for this study. To that end, we aim to answer the following research questions:
2 Background
Our approach toward designing the binary requirements classification and require-
ments traceability tasks for our experimental setup involves influences from require-
ments Information Retrieval (IR). In the context of requirements IR, classification
involves categorizing requirements into different groups [20]. Requirements classi-
fication and IR share common principles related to information organization, search
and retrieval, and semantic understanding among other things. IR methods are also
used to search for specific traceability information during software development,
helping stakeholders locate related artefacts and trace the relationships between
them.
Zero-Shot Prompting In its most basic form, zero-shot prompting involves using
natural language sentences to convey the “problem to be solved” or the “expected
output,” without providing any examples [7]. It is a technique modelled after Zero-
Shot Learning (ZSL), which directly applies previously trained models to predicting
both seen and unseen classes without using any labelled training instances [9].
Few-Shot Prompting Few-shot prompting builds on zero-shot prompting by
conveying the “problem to be solved” or the “expected output” using a few
demonstrations of the task (examples) at inference time as conditioning [14].
Chain-of-Thought Prompting Chain-of-thought prompting is a technique that is
observed to enhance the reasoning capabilities of an LLM. Using this technique,
the user constructs the prompts in a way that makes the model generate a coherent
series of intermediate reasoning steps that lead to the final answer for the task at
hand [16].
There are several works in the existing literature that focus on the application of
LLMs for various IR-specific software engineering (SE), while few focus on RE
tasks like binary and multiclass classification of requirements. Zhang et al. [20]
empirically evaluate ChatGPT’s performance on requirements IR tasks. Under the
zero-shot setting, their results reveal that ChatGPT’s performance in IR tasks has
high recall but low precision. They posit their evaluation provides preliminary
evidence for designing and developing more effective requirements IR methods
based on LLMs. Alhoshan et al. [1] report an extensive study using the contextual
word embedding-based zero-shot learning (ZSL) approach for requirements classi-
fication. The study tested this approach by conducting more than 360 experiments
using four language models with a total of 1020 requirements and found generic
language models trained on general-purpose data perform better than domain-
Requirements Engineering Using Generative AI: Prompts and Prompting Patterns 113
specific language models under the zero-shot learning approach. Their results
show that ZSL achieves F-Scores from 66% to 80% for binary and multiclass
classification tasks.
To the best of the authors’ knowledge, there are no works that focus on measuring
the performance of LLMs in requirements traceability tasks. A systematic mapping
study (SMS) performed by Li et al. [19] presents 32 machine learning (ML)
technologies and 7 enhancement strategies for establishing trace links through
their work. Their results indicate that ML technologies show promise in predicting
emerging trace links by leveraging existing traceability information within the
requirements. They identified three studies that they classified under the “seman-
tically similar words extraction” enhancement strategy.
White et al. [17] present prompt design techniques for software engineering in the
form of patterns to automate common software engineering activities using LLMs.
Prompt patterns serve as a means of knowledge transfer, similar to software patterns,
by offering reusable solutions to common problems related to generating output
and engaging with LLMs. They establish a framework for documenting prompt
patterns that can be adapted to various domains, providing a systematic approach
for structuring prompts to tackle a range of issues. The academic literature presents
and discusses a catalogue of patterns that have proven successful in enhancing the
quality of LLM-generated conversations.
Despite these studies, to our knowledge, there are no works that focus on
measuring the performance of an LLM in performing a certain RE task while using
a specific prompt pattern to craft your input prompts. This is crucial since there is
an opportunity to identify prompt patterns that work better in comparison to others
for particular RE tasks.
3 Methodology
We selected five prompt patterns for our experiments with the RE tasks, (1)
Cognitive Verifier, (2) Context Manager, (3) Persona, (4) Question Refinement, and
(5) Template, out of 16 patterns presented by White et al. [17]. They have been
selected on the basis of the descriptions for each prompt pattern provided by the
authors, which include the intent and motivation behind the pattern’s proposal, the
structure and key ideas that the pattern represents, an example implementation of the
pattern in practice, and the observed consequences of the pattern’s implementation
in practice. We used these five patterns to craft prompts for the selected RE tasks
and presented these prompts in Table 1. We performed each experiment five times
using each of the prompts presented in Table 1, gathering the model’s replies and
computing the aggregated performance measures (precision, recall, accuracy, and
F-score) for each run.
114 K. Ronanki et al.
3.2 Datasets
We used the PROMISE dataset [5] for the classification task and the PURE dataset
for the requirements traceability task as they have been widely used in literature.
The PROMISE dataset, which is available in .arff file format, was converted into
.csv format. The CSV file (ground truth for our experiment) had a total of 621
requirements, out of which 253 requirements were functional requirements labelled
as F and 368 were non-functional requirements labelled as NF.
Requirements Engineering Using Generative AI: Prompts and Prompting Patterns 115
3.3 Tasks
We begin the study by conducting a series of controlled experiments with the model
for two tasks, (i) Binary Classification of Functional Requirements (FR) and Non-
Functional Requirements (NFR) and (ii) Requirements Traceability, described in
more detail below.
Binary Classification of Functional and Non-Functional Requirements The
task aims to distinguish Functional Requirements (FR) from Non-Functional
Requirements(NFR), assuming that a requirement belongs to either the FR or
NFR class. The PROMISE NFR dataset [5] was used for this purpose. The process
we followed is:
1. We wrote a Python program that randomly picks 50 requirements from a
CSV file. This CSV file contains 621 unlabelled requirements, the same 621
requirements that make up the PROMISE dataset.
2. We also input a prompt that we created using one of five patterns right into the
program.
3. We use the model through the API to perform the classification task. We do this
five times, and each time, the program chooses a different set of 50 requirements
randomly.
4. The program then automatically compares the classification results with the
ground truth results from the PROMISE dataset.
5. We repeated this whole process five times, once for each of the prompt patterns
we were testing.
Requirements Traceability For this task, we had two sets of software requirement
specifications (SRS) taken from the PURE dataset [6]: one for a home temperature
control system called “THEMAS” and the other for defining a game interface and
its functionalities, known as “QHEADACHE.” The process we followed is:
116 K. Ronanki et al.
1. For consistency, we manually formatted these SRS files and removed any
unnecessary information, like hyperlinks.
2. We then created modified versions of these documents, referred to as “THEMAS
clean” and “QHEADACHE clean,” where we removed trace links.
3. We provided these cleaned documents, without trace links, to the model pro-
grammatically.
4. The model was given a requirement (randomly selected from the input docu-
ments) and asked to identify all related or dependent requirements.
5. We repeated this process five times for each prompt pattern, with a different
randomly selected set of requirements each time.
6. This entire procedure was repeated five times, once for each prompt pattern we
were testing.
The RE field has often adopted IR’s measures, i.e., precision, recall, and the F-
measure, to assess the effectiveness of any tool [2]. Since both tasks selected and
defined for our study are related to requirements IR, we also used precision, recall,
F-Score, and accuracy to measure the performance of the model in performing
both the RE tasks using the five prompt patterns. These measures were computed
programmatically by comparing the model’s outputs with the ground truth.
Consider a scenario where an analyst wants to identify all NFRs in a specifica-
tion. In this scenario, a high recall indicates that the majority of the NFRs selected
were accurately categorized as NFRs. Conversely, a low recall suggests that the
majority of requirements were misclassified, with FRs being mistakenly identified
as NFRs. A high precision signifies that most of the requirements classified as NFRs
by the LLM are indeed NFRs. On the contrary, a lower precision suggests that a
number of requirements identified as NFRs by the LLM are, in fact, FRs.
Consider a scenario where a requirement is marked as “deprecated.” In this
scenario, it is important to trace all affected dependencies, i.e., find all requirements
associated with the deprecated requirement. In this context, a high recall would
signify that the majority of the associated or dependent requirements have been
appropriately identified. Conversely, a low recall would indicate that only a limited
number of the connections have been recognized. Precision provides a measure of
the accuracy of retrieved links in relation to the dataset. A high precision indicates
that a significant portion of the retrieved results indeed align with the dataset’s true
links. On the other hand, a low precision implies that a considerable number of the
retrieved results are not truly linked and should not have been flagged.
The F-score is the harmonic mean of precision and recall that takes both false
positives and false negatives into account. It combines both of them into a single
metric to provide a balanced evaluation of a model’s performance.
Requirements Engineering Using Generative AI: Prompts and Prompting Patterns 117
While this study provides an analysis of prompt patterns in the context of binary
requirements classification and tracing dependent requirements using LLMs, it is
essential to acknowledge and address potential threats to the validity of the findings.
Internal Validity There is a possibility for the existence of a degree of uncer-
tainty in the ground truth since the data in our chosen datasets are labelled by
humans(contributing authors of the dataset). Inter-rater variability and potential
labelling bias could impact the reliability of performance metrics. The preprocessing
steps applied to the data, such as cleaning, could influence the model’s ability to cap-
ture complex patterns as well. The prompts crafted and presented in Table 1 using
the prompt patterns are also subject to the authors’ capabilities and competence with
the task at hand.
External Validity Given the unique nature of some RE tasks, the findings from
this study may not always generalize beyond binary requirements classification and
tracing dependent requirements.
Construct Validity The findings of this study may be constrained by the charac-
teristics and representativeness of the datasets in use. Since we used a GPT model,
there is a possibility that the datasets we used could have been part of the training
data for the GPT model. Next, the choice of prompt patterns is a critical aspect of
this study. The selected patterns may not fully encompass the spectrum of possible
patterns, potentially leading to an incomplete representation of LLM performance
for the selected RE tasks.
This section presents the aggregated results from the described experiments. Sec-
tion 4.1 presents the performance measures of the model in performing the two RE
tasks using the five selected prompt patterns. Section 4.2 presents recommendations
for RE researchers and practitioners on which prompt patterns to use for the selected
RE tasks based on our analysis of the results obtained, answering RQ1. We abstract
our methodology and present it as a framework to evaluate the performance of any
prompt pattern for a chosen RE task in Sect. 4.3, answering RQ2.
Tables 2 and 3 present the performance measures of the model using all five
prompt patterns in the binary requirements classification task and requirements
traceability task, respectively. The red column represents the value of the metric
118 K. Ronanki et al.
Table 2 Performance measures of the model using all five prompt patterns in binary requirements
classification
Cognitive Verifier
Context Manager
Persona
Question Refinement
Template
Requirements Engineering Using Generative AI: Prompts and Prompting Patterns 119
Table 3 Performance measures of the model using all five prompt patterns in requirements
traceability
Cognitive Verifier
Context Manager
Persona
Question Refinement
Template
120 K. Ronanki et al.
Table 4 Standard deviation of performance measures for the five prompt patterns in binary
requirements classification task
Binary requirements classification P-STDEV R-STDEV F-STDEV A-STDEV
Cognitive verifier 1.9% 2.0% 1.7% 1.7%
Context manager 3.7% 13.5% 7.7% 2.0%
Persona 5.4% 0.1% 2.8% 3.3%
Question refinement 2.0% 2.5% 0.5% 0.8%
Template 5.3% 0.8% 3.5% 4.8%
Table 5 Standard deviation of performance measures for the five prompt patterns in requirements
traceability task
Tracing dependant requirements P-STDEV R-STDEV F-STDEV A-STDEV
Cognitive verifier 3.7% 1.0% 3.0% 1.1%
Context manager 2.5% 4.5% 2.1% 1.4%
Persona 3.8% 2.7% 3.9% 1.4%
Question refinement 1.2% 1.7% 1.1% 0.3%
Template 3.9% 3.3% 3.3% 1.5%
for the temperature setting of 1.0, the green column represents the value of the
metric for the temperature setting of 0.4, and the blue column represents the value
of the metric for the temperature setting of 0.0. Tables 4 and 5 present the standard
deviations of precision, recall, F-Score, and accuracy measures across the three
temperature settings, denoted by P-STDEV, R-STDEV, F-STDEV, and A-STDEV,
respectively.
The precision, recall, and F1 scores, which serve as crucial indicators of the
model’s performance, consistently exhibit higher values in the binary requirements
classification task as compared to the tracing of dependent-requirements task.
Observations from the results presented above underscore the model’s ability
to discern and classify binary requirements effectively, demonstrating a higher
precision in isolating relevant instances and a better recall in identifying all pertinent
cases. A major observation deviation from this is the higher accuracy scores the
model achieved for all patterns for the requirements traceability task in comparison
to the binary requirements classification task. This rather significant deviation might
be the result of the model being more adept at accurately predicting false negatives
than identifying true positives, true negatives, and false positives. This leads to
a scenario where not wrongly predicting non-existing trace links between two
requirements within the SRS documents results in higher accuracy scores. This is
why we focus our analysis and base our observations more on precision, recall, and
F-score measures than accuracy.
Cognitive Verifier Pattern From Table 2, we can see that the recall is higher than
precision when implementing this pattern for binary requirements classification.
This holds true for all three temperature settings. This means that the model is
observed to be more adept at categorizing NFRs accurately as NFRs (predicting
Requirements Engineering Using Generative AI: Prompts and Prompting Patterns 121
true positives) than making sure the categorized NFRs are indeed NFRs (predicting
true positives as well as false negatives) using the Cognitive Verifier pattern.
The variability in standard deviation scores, as seen in Table 4, suggests that
the effectiveness of these patterns is sensitive to the temperature setting. A higher
standard deviation score indicates lower dependability in the model’s classification
results at varying temperature settings. The standard deviation scores of precision,
recall, F-score, and accuracy are no more than two percentage points for this
pattern. This means that the effect of temperature on the model’s performance is low
when using this pattern, indicating a higher dependability in binary requirements
classification.
When it comes to the requirements traceability task, this pattern yielded higher
precision, recall, F-score, and accuracy at higher temperature settings, as seen in
Table 3. The standard deviation of recall and accuracy is observed to be lower
compared to the standard deviation of precision and F-score as observed in Table 5.
This indicates this pattern is not as good at performing requirements traceability as
it is at binary requirements classification.
Context Manager Pattern From Table 2, we observe that precision increases by
6% between low- and high-temperature settings, while recall sees a significant
drop from 73% to 49%. The F-score and accuracy also drop with an increase
in temperature but not as significantly as recall. The standard deviation scores of
precision, recall, F-score, and accuracy also vary significantly as in from Tables 4
and 5. These inconsistent results and high variability across standard deviation
measures indicate that this pattern is not a very dependable pattern to use when
performing binary requirements classification.
When looking at the requirements traceability task, this pattern yielded the lowest
precision, recall, and F-score values among all the patterns. Only the accuracy scores
are somewhat closer to the accuracy values of other patterns. Therefore, this pattern
might not be a suitable choice for performing requirements traceability either.
Persona Pattern The performance of the model when using the Persona pattern is
better than the Context Manager pattern, but not as good as the Cognitive Verifier
pattern. It has a higher precision value compared to the Cognitive Verifier, but the
standard deviation of precision values is almost three times that of the Cognitive
Verifier pattern. This suggests that when the temperature is not known or adjustable,
the Persona pattern might not be the most dependable if precision is more important.
A major observation is the almost negligible effect the temperature setting has on
the recall, standing at 0.1%. This means the Persona pattern has the ability to make
the model predict true positives with the same level of accuracy with little to no
impact from varying the model’s temperature.
The Persona pattern yielded the highest recall scores compared to the other
patterns when it comes to requirements traceability task. The standard deviation
of precision, recall, and F-score is between 3% and 4%, which is the second highest
among the chosen patterns. The Persona pattern does not seem to be a great choice
when performing requirements tracing, unless the focus is solely on achieving a
122 K. Ronanki et al.
high recall. Even then, we recommend a lower model temperature setting to achieve
the best results.
Question Refinement Pattern The Question Refinement Pattern yielded the high-
est average precision scores in comparison to other patterns. The recall, F-score,
and accuracy are slightly lower than the Cognitive Verifier pattern but also slightly
higher than the other patterns. The standard deviation of precision and recall are
around 2% and 2.5%, but for F-score and accuracy, they are under 1.0%. This
means this pattern can be considered a dependable pattern to use when performing
binary requirements classification with a statistically insignificant effect of varying
the model’s temperature.
Even in the requirements traceability task, this pattern yielded the highest average
precision, F-score, and accuracy values in comparison to other patterns. It also
achieved the lowest standard deviation scores for all the performance metrics,
making it the top choice for implementation for requirements traceability task.
Template Pattern Finally, coming to the Template pattern, it has the highest
observed recall measures across all three temperature settings among all patterns
tested. The standard deviation of recall is also very low, indicating the model can
predict true positives with the same level of accuracy with little to no impact
from varying the model’s temperature. However, the precision and accuracy are
below 70% at default and higher-temperature settings. It has a significant amount
of standard deviation for precision, F-score, and accuracy across temperatures. This
makes us question the pattern’s ability to make the model yield consistent results
when performing binary requirements classification.
The template pattern does not show any noteworthy improvement in results when
it comes to the requirements traceability task. The standard deviation scores of
precision, recall, and F-score are all above 3%, with none of the performance metrics
achieving higher scores in comparison to other patterns.
4.2 Recommendations
Based on our analysis of the results presented in Table 2, we can say that the
Cognitive Verifier pattern and the Question Refinement pattern are better suited
for the binary requirements classification task at any temperature setting compared
to the Context Manager, the Template, and the Persona patterns.
Similarly, we can say that the Question Refinement pattern is the most
consistent and reliable pattern among the five for tracing dependent requirements
followed by the Cognitive Verifier and the Persona pattern. The Context Manager
pattern and the Template pattern are the least reliable pattern for this task (Table 6).
Overall, the Question Refinement pattern shows consistent results across both
the classification and requirements traceability tasks. The Cognitive verifier pattern
and the Persona pattern obtained higher performance scores in binary requirements
classification, although their performance in tracing dependent requirements was
Requirements Engineering Using Generative AI: Prompts and Prompting Patterns 123
Table 6 Rank based prompt pattern recommendation for overall and individual tasks
Rank Binary classification Tracing Overall
1st Question Refinement Question Refinement Question Refinement
2nd Cognitive Verifier Cognitive Verifier Cognitive Verifier
3rd Persona Persona Persona
4th Template Template Template
5th Context Manager Context Manager Context Manager
reduced. The Context Manager pattern was found to have a greater degree
of variability in its STDEV measures for both tasks. Our results indicate this
pattern may not be the best-suited pattern for performing the selected requirements
engineering tasks.
In order to evaluate the effectiveness of prompt patterns for any RE task, we propose
adopting a framework similar to the methodology used in this chapter.
– Step-1: Curate a dataset for the task in question, comprising two distinct
versions. The first version should contain the ground-truth annotations, while
the second version should be cleaned to remove any identifiers that helped in
establishing the ground truth. This version will serve as input data provided to
the GenAI model through a program script.
– Step-2: Create a program script1 that mimics the RE task’s underlying logic.
The script should be designed to leverage the capabilities of the GenAI model
via an API call function. Clearly specify the desired output format within the
prompt embedded in the code script. Ensure that the script is capable of taking
the second version of the dataset as input and generating results using the GenAI
model.
– Step-3: Execute the code script created in the previous step to generate results.
The script should make API calls to the GenAI model, using the specified prompt
pattern.
– Step-4: Conduct a comparative analysis of the obtained results against the
ground-truth annotations (from the first version of the dataset). This analysis will
provide insights into how well the GenAI model performed in relation to the
ground truth.
– Step-5: Use the comparative assessment results to evaluate the effectiveness of
the prompt pattern(s) in the context of the specific RE task.
1 https://github.com/beatrizcabdan/GenAI4REtasks
124 K. Ronanki et al.
By analyzing the requirements and objectives of the task, one can determine
whether the nature of the task leans more toward binary classification or require-
ments traceability. Based on this, our recommendations for which prompt patterns
to apply may hold. However, the results of this study may not consistently apply to
other situations. When confronted with a novel RE task that cannot be framed as a
requirements IR task, it is advisable to experiment with the patterns and evaluate the
effectiveness of patterns in the context of the specific task and dataset as suggested
in our evaluation framework.
5 Discussion
The results of our research study shed light on the effectiveness of different prompt
patterns in the context of RE tasks, specifically focusing on binary requirements
classification and requirements traceability. In the following, we will delve into the
implications of these findings and their practical applications in real-world practice.
Our study revealed that the Cognitive Verifier and Question Refinement
patterns achieved the best results in binary requirements classification. These
patterns provide a reliable and consistent approach to achieving accurate and reliable
classifications. In real-world RE practice, RE practitioners can consider adopting
either the Cognitive Verifier or Question Refinement pattern for tasks that involve
binary classification. For instance, when evaluating software requirements for
compliance with specified standards, these patterns could be used to streamline
the classification process, reducing manual effort and potential errors. The Persona
pattern seems to exhibit better results at lower-temperature settings compared to
higher-temperature settings, indicating it is better suited to classification tasks where
less creative and more definitive responses are required.
In the case of requirements traceability, our findings indicate that the Question
Refinement pattern outperforms others. This suggests that when the RE task
involves establishing relationships and dependencies between various requirements,
using the Question Refinement pattern is the most effective option. For RE teams
tasked with tracing dependencies among requirements, the Question Refinement
pattern has significantly better performance compared to other patterns. This is
particularly valuable in complex projects where understanding how changes in one
requirement may impact others is critical.
Since the performance measures were calculated automatically using the depen-
dencies provided in the datasets, we did not manually investigate the requirements
that were misidentified as dependent (or independent). Further work, particularly
work focusing specifically on traceability, should look closer to investigate the
reasons behind the misclassification. The sensitivity of the prompt used to generate
some artefacts using LLMs or any other GenAI model opens up possibilities where
minimal changes to the prompt can result in significant differences in the quality
of the output as well as the performance of the model. This experimental study
was limited to examining the effect of pattern-level variations on the prompts used
Requirements Engineering Using Generative AI: Prompts and Prompting Patterns 125
and did not look into the specific wording of the prompts, in order to keep the
experimental results tractable. It is equally important to remember that using LLMs
for RE tasks should be limited to assisting relevant RE stakeholders with appropriate
human oversight mechanisms in place instead of automating these tasks. Therefore,
the usefulness of LLMs, and subsequently, prompt patterns that are used to craft
the prompts to interact with the LLMs, comes with limitations and needs more
dedicated research results to discuss it in depth.
While our study provides insights into the effectiveness of specific patterns, it is
essential to acknowledge the unique nature of various RE tasks. Not all tasks can
be framed as requirements IR tasks. Therefore, organizations should consider an
analysis of their specific RE requirements and objectives. When confronted with
novel RE tasks, teams can follow a structured approach similar to our proposed
framework presented in Fig. 1 to identify the most suitable prompt pattern. By
breaking down the workflow and considering the task’s nature, they can adapt and
experiment with different patterns to optimize results. This iterative process allows
for continuous improvement in the choice of patterns for specific tasks.
6 Conclusion
In conclusion, the insights garnered from this study offer guidance for practitioners
seeking to leverage prompt patterns for using GenAI in RE tasks. Our research offers
recommendations on the selection and adoption of prompt patterns for real-world
RE tasks. We suggest that the Question Refinement pattern might serve as a suitable
compromise for both tasks. Moreover, the chapter presents an evaluation framework
based on the methodology used in our study on how one might evaluate and decide
which prompt pattern could be the most effective for a new RE task, one that
considers the trade-offs between precision, recall, and accuracy. Practitioners/other
researchers can use this framework as a guideline for assessing the suitability of
prompt patterns for their unique RE tasks. By understanding the strengths and
limitations of different patterns and employing a structured evaluation framework,
organizations can enhance the efficiency and accuracy of their RE processes,
ultimately leading to improved software development outcomes and project success.
The insights presented can lay the foundation for several avenues of future
research, aiming to deepen our understanding of prompt patterns and further
126 K. Ronanki et al.
enhance the performance of GenAI in RE. Future investigations could delve into
a more exhaustive exploration of prompt patterns, potentially identifying novel
approaches. An ensemble approach, combining the merits of different patterns,
may mitigate the limitations associated with individual patterns and contribute to
a more robust and adaptable classification framework. Future research could focus
on optimizing the balance; exploring strategies that prioritize comprehensive recall
without compromising precision or accuracy, which could lead to more context-
aware and adaptable models; and synthesizing new prompt patterns and evaluation
framework. Another avenue of further research is to establish the boundary of
LLMs’ application in RE activities with appropriate human oversight mechanisms
in place to ensure the ethical and responsible application of these technologies.
Acknowledgments This work was supported by the Vinnova project ASPECT [2021-04347].
References
1. Alhoshan, W., Ferrari, A., Zhao, L.: Zero-shot learning for requirements classification: an
exploratory study. Inf. Software Technol. 159, 107202 (2023)
2. Berry, D.M., Cleland-Huang, J., Ferrari, A., Maalej, W., Mylopoulos, J., Zowghi, D.: Panel:
context-dependent evaluation of tools for NL RE tasks: recall vs. precision, and beyond.
In: 2017 IEEE 25th International Requirements Engineering Conference (RE), pp. 570–573
(2017). https://doi.org/10.1109/RE.2017.64
3. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A.,
Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Adv. Neural
Inf. Process. Syst. 33, 1877–1901 (2020)
4. Cheng, Y., Chen, J., Huang, Q., Xing, Z., Xu, X., Lu, Q.: Prompt Sapper: A LLM-Empowered
Production Tool for Building AI Chains. Preprint (2023). arXiv:2306.12028
5. Cleland-Huang, J., Mazrouee, S., Liguo, H., Port, D.: NFR (2007). https://doi.org/10.5281/
zenodo.268542
6. Ferrari, A., Spagnolo, G.O., Gnesi, S.: PURE: a dataset of public requirements documents.
In: 2017 IEEE 25th International Requirements Engineering Conference (RE), pp. 502–505
(2017). https://doi.org/10.1109/RE.2017.29
7. Fiannaca, A.J., Kulkarni, C., Cai, C.J., Terry, M.: Programming without a programming
language: challenges and opportunities for designing developer tools for prompt programming.
In: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems,
CHI EA ’23. Association for Computing Machinery, New York, NY, USA (2023). https://doi.
org/10.1145/3544549.3585737
8. Haque, M.U., Dharmadasa, I., Sworna, Z.T., Rajapakse, R.N., Ahmad, H.: “I think this is the
most disruptive technology”: Exploring Sentiments of ChatGPT Early Adopters Using Twitter
Data. Preprint (2022). arXiv:2212.05856
9. Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by
between-class attribute transfer. In: 2009 IEEE Conference on Computer Vision and Pattern
Recognition, pp. 951–958. IEEE (2009)
10. Liu, V., Chilton, L.B.: Design guidelines for prompt engineering text-to-image generative
models. In: Proceedings of the 2022 CHI Conference on Human Factors in Computing
Systems, CHI ’22. Association for Computing Machinery, New York, NY, USA (2022). https://
doi.org/10.1145/3491102.3501825
Requirements Engineering Using Generative AI: Prompts and Prompting Patterns 127
11. Lu, Y., Bartolo, M., Moore, A., Riedel, S., Stenetorp, P.: Fantastically Ordered Prompts
and Where to Find Them: Overcoming Few-shot Prompt Order Sensitivity. Preprint (2021).
arXiv:2104.08786
12. Nguyen, N., Nadi, S.: An empirical evaluation of GitHub copilot’s code suggestions. In:
Proceedings of the 19th International Conference on Mining Software Repositories, MSR ’22,
pp. 1–5. Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/
10.1145/3524842.3528470
13. Peng, S., Kalliamvakou, E., Cihon, P., Demirer, M.: The Impact of AI on Developer
Productivity: Evidence from GitHub Copilot. Preprint (2023). arXiv:2302.06590
14. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., others: Language models
are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
15. Reynolds, L., McDonell, K.: Prompt programming for large language models: beyond the
few-shot paradigm. In: Extended Abstracts of the 2021 CHI Conference on Human Factors
in Computing Systems, CHI EA ’21. Association for Computing Machinery, New York, NY,
USA (2021). https://doi.org/10.1145/3411763.3451760
16. Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, b., Xia, F., Chi, E., Le, Q.V.,
Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models. In:
Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances
in Neural Information Processing Systems. vol. 35, pp. 24824–24837. Curran Asso-
ciates, Inc., New York (2022). https://proceedings.neurips.cc/paper_files/paper/2022/file/
9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf
17. White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith,
J., Schmidt, D.C.: A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT
(2023). https://doi.org/10.48550/arXiv.2302.11382
18. White, J., Hays, S., Fu, Q., Spencer-Smith, J., Schmidt, D.C.: ChatGPT Prompt Patterns for
Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design (2023).
https://doi.org/10.48550/arXiv.2303.07839
19. Xu, C., Li, Y., Wang, B., Dong, S.: A systematic mapping study on machine learning
methodologies for requirements management. IET Software 17(4), 405–423 (2023). https://
doi.org/10.1049/sfw2.12082
20. Zhang, J., Chen, Y., Niu, N., Wang, Y., Liu, C.: Empirical Evaluation of ChatGPT on
Requirements Information Retrieval Under Zero-Shot Setting (2023). https://doi.org/10.48550/
arXiv.2304.12562
Advancing Requirements Engineering
Through Generative AI: Assessing the
Role of LLMs
1 Introduction
Requirements Engineering (RE) is arguably the most critical task in the software
development process, where the needs and constraints of a system are identified,
analyzed, and documented to create a well-defined set of requirements [19]. Orga-
nizations and project teams often overlook or do not understand the significance
of RE and its impact on project success [24]. Some underlying reasons for the
lack of effort and resources spent in RE include (i) time, budget, and resource
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 129
A. Nguyen-Duc et al. (eds.), Generative AI for Effective Software Development,
https://doi.org/10.1007/978-3-031-55642-5_6
130 C. Arora et al.
constraints; (ii) inadequate training and skills; (iii) uncertainty and ambiguity in
early stages, which teams consider as challenging, causing them to cut corners in the
RE process; (iv) inadequate tools and automation support [5]; and (v) emphasis on
an implementation-first approach instead [14]. These lead to significant challenges
in the later stages of development as issues related to inconsistent, incomplete,
and incorrect requirements become increasingly difficult to resolve, resulting in
increased development costs, delays, and lower-quality software systems [19].
In this chapter, we contend that the recent advances in large language models
(LLMs) [13] might be revolutionary in addressing many of these RE-related
challenges noted above, though with some caveats. LLMs are advanced AI models
designed to process and generate human language by learning patterns and struc-
tures from vast amounts of text data. These models have made significant strides in
natural language processing (NLP) tasks and are particularly adept at handling com-
plex language-based challenges. LLMs, including OpenAI’s Generative Pre-trained
Transformer (GPT) series and Google’s Bidirectional Encoder Representations from
Transformers (BERT) [8] and LaMDA [23], learn to comprehend and generate
human language by predicting the most probable next word in a given sequence,
capturing the probability distribution of word sequences in natural language (NL).
OpenAI’s ChatGPT1 and Google’s Bard2 , built on the advancements of the LLMs,
are examples of chatbot platforms designed to facilitate interactive and dynamic
text-based conversations. When a user provides input to ChatGPT or Bard, the
model processes the text and generates a contextually appropriate response based
on the patterns learned during the training process.
A large majority of requirements are specified using NL. LLMs thus have the
potential to be a “game-changer” in the field of RE. This could be by automating and
streamlining several crucial tasks and helping to address many of the RE challenges
mentioned earlier. With the focus on automated code generation using LLMs,
delivering concise and consistently unambiguous specifications to these models
(as prompts) becomes paramount. This underscores the ever-growing significance
of RE in this new era of generative AI-driven software engineering. This chapter
explores the potential of LLMs to transform the RE processes. We present a SWOT
(strengths, weaknesses, opportunities, and threats) analysis for applying LLMs in
all key RE stages, including requirements elicitation, analysis, and specification.
We also discuss examples from a preliminary evaluation as motivation for using
LLMs in all RE stages.
Preliminary Evaluation Context We performed a preliminary evaluation on a
real-world app (pseudonym ActApp), encouraging patients with type-2 diabetes
(T2D) to remain active. To ensure that the app is effective, engaging, and per-
sonalized, the ActApp team implemented a machine learning (ML) model in the
background to learn from user behavior and preferences and suggest appropriate
1 https://chat.openai.com/.
2 https://bard.google.com/.
Advancing Requirements Engineering Through Generative AI: Assessing the. . . 131
reminders and activities. The team has a mix of experienced engineers and an
ML scientist (with little understanding of RE). Our preliminary evaluation and the
examples in the chapter are done using ChatGPT (GPT-3.5).
Structure Section 2 provides an overview of our vision of the role of LLMs
in RE process. Sections 3, 4, 5, and 6 cover the four major RE stages, i.e.,
elicitation, specification, analysis, and validation, respectively. Section 7 presents
our preliminary evaluation results. Section 8 covers the lessons learned, and Sect. 9
concludes the chapter.
2 LLMs-Driven RE Process
3 Requirements Elicitation
LLMs can address numerous key challenges in the elicitation phase, including
domain analysis. LLMs can rapidly absorb vast amounts of domain-specific lit-
erature, providing a foundational structuring and acting as a proxy for domain
knowledge source [15]. They can assist in drawing connections, identifying gaps,
and offering insights based on the existing literature and based on automated tasks
such as as-is analysis, domain analysis, and regulatory compliance. In addition to
stakeholder communication, leveraging LLMs would require other inputs such as
existing domain or project-specific documentation (e.g., fine-tuning LLMs) and
regulations (e.g., GDPR). While LLMs have access to the domain knowledge, it
is difficult to replace domain specialists’ intuition, experience, and expertise. For
example, in ActApp , the nuanced understanding of how specific exercises influence
a patient’s glucose or hormonal levels rests with medical professionals such as
endocrinologists, who are irreplaceable in RE.
LLMs help identify unknowns by analyzing existing documentation and high-
lighting areas of ambiguity or uncertainty. LLMs can help with the completion
or suggest alternative ideas that the requirements analysts might have otherwise
missed, drawing on their large corpus of training data and connections. LLMs
can assist with translating complex technical jargon into plain language and
aiding stakeholders from different linguistic backgrounds, e.g., translating medical
terminology in ActApp for requirements analysts or translating domain information
from one language to another.
134 C. Arora et al.
LLMs play a vital role in human-centric RE. They can analyze diverse user
feedback, like app reviews, ensuring all user needs are addressed. LLMs can
also simulate user journeys considering human-centric factors, but this necessitates
resources such as app reviews, persona-based use cases, and accessibility guidelines.
For emerging technologies, LLMs need regular updates, a challenging task since
automated solutions might be affected by these updates. The use of LLMs in
requirements elicitation also warrants ethical scrutiny. LLMs may introduce or
perpetuate biases as they are trained on vast internet data. Ensuring the ethical use
of LLMs means avoiding biases and guaranteeing that the stakeholders’ inputs are
managed according to the data privacy and security guidelines. LLMs output should
be viewed as complementary to human efforts. Requirements analysts bring domain
expertise, cultural awareness, nuanced understanding, and empathetic interactions
to the table, ensuring that software requirements cater to the diverse and evolving
needs of end users. This synergy of humans and generative AI is crucial in human-
centric software development.
Example For the ActApp, LLMs are used to gather information from various
stakeholders, including patients and carers. The agent can conduct virtual interviews
with the stakeholders (for a given persona, as exemplified below), asking targeted
questions to identify their needs, preferences, and concerns. For instance, the agent
may inquire users about desired features and data privacy concerns. Additionally,
LLMs can analyze and synthesize information from online forums, social media,
reviews from similar apps, and research articles on disease management to extract
insights into common challenges patients face and best practices for care. This
information can generate preliminary requirements (e.g., R1 and R2 below), which
can be refined in later stages.
4 Requirements Specification
LLMs can streamline the specification process. The unstructured requirements from
the elicitation stage can be automatically formatted into structured templates like
EARS or user stories (see the example prompt below for EARS and the example for
user stories). They can further assist in categorizing requirements into functional
and non-functional and classifying NFRs like performance, ethical requirements,
and usability. LLMs can automate other tasks during specification, e.g., generating
glossary, rationale, and examples, developing personas [29]. Another advantage
of LLMs is their ability to cross-check requirements against existing standards,
regulatory guidelines, or best practices. For a health-focused app like ActApp,
LLMs can ensure alignment with health data privacy standards and medical device
directives.
LLMs can also suggest requirements prioritization by analyzing technical depen-
dencies, project goals, and historical data. However, generating requirements
prioritization requires several SE roles and deep-seeded expertise. Hence, the results
produced by LLMs might be inaccurate. On similar lines, while LLMs can enhance
the speed and consistency of specification, there is a risk of “over-automation,” i.e.,
overlooking some crucial aspects or over-trusting the requirements produced by
LLMs. For instance, determining the criticality of specific NFRs—like how secure
a system needs to be or how scalable—often requires human expertise. LLMs can
aid the process, but decisions should be validated by domain experts familiar with
the project context. Similarly, for compliance issues, it is essential to have domain
experts validate the results.
Advancing Requirements Engineering Through Generative AI: Assessing the. . . 137
Example Prompt. Using the EARS template defined by the BNF grammar below,
generate the <requirement> from the unformatted requirement - “The patients
should not receive notifications when busy.”
<requirement> ::= <ubiquitous> | <event-driven> | <state-driven> | <optional>
| <unwanted>
<ubiquitous> ::= “The system shall <action>.”
<event-driven> ::= “When <event>, the system shall <action>.”
<state-driven> ::= “While <state>, the system shall <action>.”
<optional> ::= “The system shall <action>.”
<unwanted>::= “The system shall <preventive-action> to <unwanted-outcome>.”
Example In ActApp, the LLMs can generate refined requirements as user stories
(desired by ActApp team members). The requirements document may include
sections such as an introduction, a description of ActApp stakeholders, a list of
functional and non-functional requirements, a list of ActApp features with priorities,
and any constraints or assumptions related to the development process. For non-
functional requirements, such as data privacy for patients’ health information,
LLMs can cross-reference with regulations, e.g., HIPAA or GDPR to ensure
compliance [1].
5 Requirements Analysis
LLMs come into play as powerful tools to automate the quality evaluation process:
1. Automated Evaluation for Quality Assurance: LLMs can automatically assess
the quality of requirements, flagging any ambiguities, vague terms, inconsisten-
cies, or incompleteness, and highlight gaps or overlaps.
2. Risk Identification and Countermeasure Proposal: LLMs, when equipped
with domain knowledge, can identify potential risks associated with require-
ments or their underlying assumptions. Drawing from historical data or known
risk patterns, LLMs can suggest new requirements that act as countermeasures
to mitigate those risks, ensuring system design and operation robustness.
3. Conflict Resolution and Negotiation: By identifying areas of contention, LLMs
can facilitate the negotiation process. Multiple LLM agents can be employed to
negotiate the requirements, suggest compromises, and simulate various scenar-
ios, helping stakeholders converge on a unified set of requirements.
4. Human-centric Requirements Enhancement: LLMs can evaluate require-
ments to ensure they cater to diverse user needs, accessibility standards, and
user experience guidelines. LLMs can also suggest requirements that enhance
the software’s usability or accessibility based on user personas or feedback.
Moreover, they can evaluate requirements for biases or potential ethical concerns,
ensuring that the software solution is inclusive and ethically sound.
140 C. Arora et al.
Example Prompt.
Context: For the ActApp system, we need to negotiate and prioritize requirements
(FR1-FR7 and NFR1-NFR5) that ensure the system caters to the patient’s health
needs while maintaining usability and data privacy.
Task: Create two agents: Agent1 (A1) represents the primary user (a T2D patient).
Agent2 (A2) represents the system’s software architect. A1 and A2 will negotiate
and discuss FR1 - FR7 to determine a priority list. During this negotiation, A1 will
focus on the user experience, health benefits, and practical needs, while A2 will
consider technical feasibility, integration with existing systems, and the architec-
tural perspective. The agents can sometimes have differing opinions, leading to a
more nuanced and realistic discussion. No decisions should violate NFR1 - NFR5.
Expected Output Format: FRs in decreasing order of priority, and include the
rationale for priority order based on the negotiation outcomes between A1 and A2.
Example In the context of ActApp, LLMs can (i) identify and resolve ambiguities
or inconsistencies in the requirements, such as conflicting preferences of patients or
unclear feature descriptions; (ii) highlight any dependencies and requisites, e.g., a
secure data storage system to support medical data storage; and (iii) generate missed
ethical and regulatory concerns related to data storage.
6 Requirements Validation
LLMs can assist in the validation phase in several nuanced ways. As highlighted in
the Analysis phase, LLMs can aid in the manual review and inspections by flagging
potential ambiguities, inconsistencies, or violations based on predefined validation
heuristics. LLMs can be utilized to simulate stakeholder perspectives, enabling
analysts to anticipate potential misinterpretations or misalignments. For instance, by
analyzing historical stakeholder feedback, LLMs can predict potential areas where
clarifications might be sought from the perspective of a given stakeholder. With their
ability to process vast amounts of data quickly, LLMs can assist in requirements
traceability to other artefacts, e.g., design documents and regulatory codes. LLMs
can further assist in formulating clear and precise acceptance criteria based on
the documented requirements. They can also propose test scenarios, ensuring a
comprehensive validation suite. Furthermore, LLMs can scan the requirements to
identify and flag any overlooked human-centric aspects, constraints, or dependen-
cies, ensuring a more comprehensive validation. While LLMs can facilitate most
validation tasks, as noted above, a major weakness of LLMs in this context is that
the validation tasks often require an overall picture of the project, domain, and
stakeholders’ viewpoints—it is extremely difficult for LLMs to work at that level
of abstraction, which typically requires manual effort from numerous stakeholders.
Advancing Requirements Engineering Through Generative AI: Assessing the. . . 143
Example Prompt.
Context: For the ActApp system, we need to perform the validation on all the
requirements specified in the system (FR1 - FR50) and (NFR1 - NFR28). The
goal is to identify the gaps in all the requirements from three different stakholders’
perspectives, the software developer, the ML scientist and the product owner.
Task: Imagine all three stakeholders are answering this question. For each require-
ment, all stakeholders will write down the gaps in the requirement based on their
role, and then share it with the group. Then all stakeholders will review the inputs
in the group and move to the next step. If any expert doesn’t have any gap identified
or a concern they can skip the discussion on that requirement.
Expected Output Format: For all gaps agreed upon by all stakeholders, export the
issue with the requirement id.
Example In ActApp, LLMs can generate acceptance criteria. Also, LLMs can
uncover gaps—in our preliminary LLMs evaluation, the ActApp team figured it
needed to comply with Australia’s Therapeutic Goods Act (TGA) regulation.
7 Preliminary Evaluation
requirements, it is important to note that with a project brief and one interaction
session, the experienced participants could get almost half the relevant requirements,
emphasizing the feasibility of LLMs for RE.
8 Lessons Learned
9 Conclusion
References
1. Abualhaija, S., Arora, C., Sleimi, A., et al.: Automated question answering for improved under-
standing of compliance requirements: A multi-document study. In: International Requirements
Engineering Conference (RE’22) (2022)
2. Ahmad, K., Abdelrazek, M., Arora, C., et al.: Requirements engineering framework for human-
centered artificial intelligence software systems. Appl. Soft Comput. 143, 110455 (2023a)
Advancing Requirements Engineering Through Generative AI: Assessing the. . . 147
3. Ahmad, K., Abdelrazek, M., Arora, C., et al.: Requirements engineering for artificial intelli-
gence systems: A systematic mapping study. Inf. Software Technol. 158, 107176 (2023b)
4. Akter, S., McCarthy, G., Sajib, S., et al.: Algorithmic bias in data-driven innovation in the age
of AI. Int. J. Inf. Manag. 60, 102387 (2021)
5. Arora, C., Sabetzadeh, M., Briand, L., et al.: Automated checking of conformance to require-
ments templates using natural language processing. IEEE Trans. Software Eng. (TSE’15)
41(10), 944–968 (2015)
6. Arora, C., Sabetzadeh, M., Briand, L., et al.: Automated extraction and clustering of require-
ments glossary terms. IEEE Trans. Software Eng. 43(10), 918–945 (2017)
7. Borji, A.: A categorical archive of chatgpt failures. Preprint (2023). arXiv:230203494
8. Devlin, J., Chang, M.W., Lee, K., et al.: BERT: Pre-training of deep bidirectional transformers
for language understanding (2018). arXiv:181004805
9. Ezzini, S., Abualhaija, S., Arora, C., et al.: Automated handling of anaphoric ambiguity in
requirements: A multi-solution study. In: 2022 IEEE/ACM 44rd International Conference on
Software Engineering (2022)
10. Gorschek, T., Wohlin, C.: Requirements abstraction model. Requir. Eng. 11, 79–101 (2006)
11. Hariri, W.: Unlocking the potential of chatgpt: A comprehensive exploration of its applications,
advantages, limitations, and future directions in natural language processing (2023). https://doi.
org/10.48550/arXiv.2304.02017
12. Hidellaarachchi, D., Grundy, J., Hoda, R., et al.: The effects of human aspects on the
requirements engineering process: A systematic literature review. IEEE Trans. Software Eng.
48(6), 2105–2127 (2022)
13. Jurafsky, D., Martin, J.H.: Speech and Language Processing, 3rd edn. (2020). https://web.
stanford.edu/~jurafsky/slp3/ (last visited 2021-06-04)
14. Laplante, P.A., Kassab, M.H.: Requirements Engineering for Software and Systems. CRC
Press, Boca Raton (2022)
15. Luitel, D., Hassani, S., Sabetzadeh, M.: Using language models for enhancing the completeness
of natural-language requirements. In: Requirements Engineering: Foundation for Software
Quality: 29th International Working Conference (2023)
16. Ma, H., Zhang, C., Bian, Y., et al.: Fairness-guided few-shot prompting for large language
models. Preprint (2023). arXiv:230313217
17. Mavin, A., Wilkinson, P., Harwood, A., et al.: Easy approach to requirements syntax (ears).
In: 2009 17th IEEE International Requirements Engineering Conference, pp. 317–322. IEEE
(2009)
18. Pan, W., Chen, Q., Xu, X., et al.: A preliminary evaluation of ChatGPT for zero-shot dialogue
understanding. Preprint (2023). arXiv:230404256
19. Pohl, K.: Requirements Engineering, 1st edn. Springer, New York (2010)
20. Robertson, J., Robertson, S.: Volere. Requirements Specification Templates (2000)
21. Sawyer, P., Sommerville, I., Viller, S.: Capturing the benefits of requirements engineering.
IEEE Software 16(2), 78–85 (1999)
22. Sutcliffe, A., Sawyer, P.: Requirements elicitation: Towards the unknown unknowns. In:
International Requirements Engineering Conference (RE’13) (2013)
23. Thoppilan, R., De Freitas, D., Hall, J., et al.: Lamda: Language models for dialog applications.
Preprint (2022). arXiv:220108239
24. van Lamsweerde, A.: Requirements Engineering: From System Goals to UML Models to
Software Specifications, 1st edn. Wiley, New York (2009)
25. Wei, J., Wang, X., Schuurmans, D., et al.: Chain of thought prompting elicits reasoning in large
language models (2022). CoRR abs/2201.11903. https://arxiv.org/abs/2201.11903
26. White, J., Fu, Q., Hays, S., et al.: A prompt pattern catalog to enhance prompt engineering with
chatgpt. Preprint (2023). arXiv:230211382
27. Yao, S., Yu, D., Zhao, J., et al.: Tree of thoughts: Deliberate problem solving with large
language models. Preprint (2023). arXiv:230510601
148 C. Arora et al.
28. Zamani, K., Zowghi, D., Arora, C.: Machine learning in requirements engineering: A mapping
study. In: 2021 IEEE 29th International Requirements Engineering Conference Workshops
(REW), pp 116–125. IEEE (2021)
29. Zhang, X., Liu, L., Wang, Y., et al.: Personagen: A tool for generating personas from user
feedback. In: 2023 IEEE 31st International Requirements Engineering Conference (RE), pp
353–354. IEEE (2023)
30. Zhao, L., Alhoshan, W., Ferrari, A., et al.: Natural language processing for requirements
engineering: A systematic mapping study. ACM Comput. Surv. (CSUR) 54(3), 1–41 (2021)
31. Zheng, L., Chiang, W.L., Sheng, Y., et al.: Judging llm-as-a-judge with mt-bench and chatbot
arena. Preprint (2023). arXiv:230605685
Part III
Generative AI in Software Development:
Case Studies
Generative AI for Software Development:
A Family of Studies on Code Generation
1 Introduction
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 151
A. Nguyen-Duc et al. (eds.), Generative AI for Effective Software Development,
https://doi.org/10.1007/978-3-031-55642-5_7
152 A. Moradi Dakhel et al.
The key insights from the book chapter are discussed in Sect. 4. Section 5 concludes
the chapter.
2 Related Work
In SE, many efforts have aimed to automate code generation over the years. Early
efforts formulated the task of code generation as a search-based problem, wherein
the objective is to discover a program within a predefined search space of a
programming language that satisfies all constraints derived from the functionality
that the code should provide [17]. Subsequent efforts shifted toward input/output
task specifications and collected a set of expressions from the pool of possible
expressions in a programming language that transfers the input examples into output
examples [18].
The advancement of generative AI has brought a profound transformation in
developing intelligent assistant tools for automatic code generation that incorporate
LLMs. Various LLMs exhibit remarkable performance in code generation tasks in
SE. Table 1 shows a list of well-known LLMs for code generation tasks, along
with their prevalent use cases. An early undertaking on pre-training transformers
with code content is CodeBERT [16]. CodeBERT is an auto-encoding LLM based
on BERT architecture [12]. Its pre-training phase encompasses 6M (Million) code
snippets and 2M pairs of natural language and code snippets, across 6 programming
languages. Its pre-training data are all extracted from the CodeSearchNet [21]
dataset gathered from GitHub public repositories. It has found application in diverse
code classification tasks such as code review classification [55], prediction of flaky
tests [2], defect detection [39], and also in code retrieval tasks [8]. While CodeBERT
continues to stand out as a practical, open-source, and lightweight model, well suited
for a variety of tasks that require code representation [13, 64], its maximum length
of tokens limits its usage for diverse code generation tasks.
One of the very powerful LLMs is OpenAI’s Codex [9], which has been
extensively applied across various SE tasks involving code generation [29, 37].
Codex is a GPT-based autoregressive LLM [6] with up to 12B (Billion) parameters
that are fine-tuned on 54M public repositories from GitHub. OpenAI provides two
versions of Codex: code-davinci-002 and code-cushman-001. The former stands as
the more capable model, with a maximum context length of 4k, whereas the latter is
154 A. Moradi Dakhel et al.
limited to 2k. Notably, Codex powers GitHub Copilot,1 an in-IDE developer coding
assistant that was released on June 29, 2021.
A new release of GitHub Copilot is “Copilot Chat,”2 which is tuned with human
feedback for dialog use cases. Copilot Chat can be employed for a wide range
of coding-related tasks. Depending on the specific context of the task, it offers
responses such as generating code snippets to implement specific functionalities,
providing natural language explanations for code segments, suggesting unit tests, or
repairing a buggy code.
Although Codex and its collaborative tool, Copilot, indicate promising perfor-
mance across a range of code generation tasks, their lack of open-source availability
renders them unsuitable for tasks that involve a tuning step.
CodeT5 operates as a sequence-to-sequence LLM [57]. CodeT5 employed code-
specific knowledge and proposed a novel-type identifier-aware approach as part
of its training objective. It is built upon the T5 architecture and is well suited
for tasks revolving around the generation of a new sequence in response to an
input sequence. An illustrative example of such tasks is code translation [58],
code summarization [1], or generative question answering [61]. The updated
version of CodeT5, called CodeT5+, has been trained on a larger volume of data,
1 https://copilot.github.com/
2 https://docs.github.com/en/copilot/github-copilot-chat
Generative AI for Software Development: A Family of Studies on Code Generation 155
One of the primary use cases of automatic code generation in the SE development
life cycle is to generate a code snippet that implements a particular functionality.
Various initiatives have concentrated on leveraging LLM to generate code that aligns
with a natural language description of a programming task, called prompt [22, 45,
59]. The experiences of developers in using tools such as Copilot or ChatGPT show
that these tools assist them in quickly completing programming tasks [5, 31]. Other
studies show that these tools save developers’ efforts in online searching and offer a
good starting point for implementing task descriptions [4, 56].
Although LLMs provide numerous insights into completing programming tasks,
different studies show that LLM suggestions can occasionally be plagued by bugs,
non-compilable code, or misalignment with task descriptions [56]. In addition to
the model’s performance, which can affect the correctness of the output of LLMs,
another factor that can impact the correctness of outputs is the complexity or
ambiguity of the prompt [25, 62]. The evaluation of different studies shows that
sometimes the code generated by LLMs fails to meet some of the conditions
described in the prompt [10].
Various approaches have been employed to address this limitation, including few-
shot learning [6] and self-planning [24]. In the former, diverse task-representative
examples, as input/output pairs, are incorporated into the prompt. The latter
approach involves breaking down original tasks into smaller steps using LLMs and
calling them plans. While both approaches have shown improvements in the rate
of accurate solutions generated by LLMs, they also have their own limitations.
The few-shot learning approach’s drawback lies in the diversity of tasks, which
may require a diverse range of illustrative examples in the prompt. Meanwhile,
the self-planning approach’s limitation pertains to evaluating the correctness of the
3 https://huggingface.co/datasets/codeparrot/github-code
156 A. Moradi Dakhel et al.
generated step-by-step plan and ensuring that the newly decomposed prompt aligns
with the primary intent.
Apart from providing incorrect suggestions, studies show that LLMs also
generate code that is complex [37], inefficient [45], and prone to reproducing vulner-
abilities and various Common Weakness Enumerations (CWEs) [41]. Consequently,
in addition to a set of test cases used to evaluate the correctness of suggestions,
various tools and methods are also necessary to filter out low-quality suggestions.
Another domain within the SE development life cycle that has been influenced by
LLM is program repair. Automatic program repairs aid developers in enhancing
code reliability. Various LLMs such as CodeT5 [23, 60], Codex [43], or Chat-
Generative AI for Software Development: A Family of Studies on Code Generation 157
GPT [49] have been employed for program repair. Their results show that directly
applying LLMs for automatic program repairs outperformed different state-of-the-
art tools across various levels of granularity: producing complete patches, inline
repairs with provided prefixes and suffixes for buggy functions, and repairing
individual lines [60].
The adoption of retrieval few-shot learning, which exhibits improvements in test
case generation, also yields interesting outcomes in the program repair [26, 36].
Fine-tuning pre-trained LLM with bug-repair datasets is another effort in this
area [23]. While fine-tuning shows some improvement in program repair compared
to direct LLM prompting and state-of-the-art program repair tools, it is expensive
and demands a considerable volume of data.
LLMs are also employed for repairing bugs in both programs and test cases that
have been generated by LLMs themselves [32, 46]. This is accomplished by re-
prompting the LLMs with, for example, the inclusion of error details in the prompt.
3 Research Method
In this section, our attention shifts toward a more comprehensive discussion about
the process of generating code using LLMs and the assessment of the quality of their
suggestions. We delve into two of our studies in this domain [10, 11] along with their
findings across three different coding-related tasks: functional implementation, test
case generation, and program (test) syntax repair. Figure 1 shows an overview of
our research method.
Fig. 1 The proposed methodology for generating and evaluating code with LLMs
A. Moradi Dakhel et al.
Generative AI for Software Development: A Family of Studies on Code Generation 159
attempts, we resort to the Python parser to identify the erroneous line. We retain
the lines prior to the erroneous line, ensuring they remain free of syntax errors.
4. Repair Intended behavior: It is possible that certain test cases within the unit
test exhibit unintended behavior. First, we exclude test cases with incorrect test
input, such as when the PUT expects a list of integers but the test input is a
string, or if the PUT expects two inputs but the test case invokes the PUT with
only one input. Next, we aim for the test cases with incorrect asserted output. It
refers to those test cases in which their asserted output may not be matched with
the expected output of the PUT. To address this unintended behavior, we draw
inspiration from Pynguin [33], an automated test generation tool for the Python
language. The approach operates by gathering the test input from the test case,
executing the PUT with the provided test input, and subsequently comparing
the expected output of the PUT (returned values) with the asserted output (test
output). If they are not equal, we repair the test case by replacing its asserted
output with the expected output.
5. Mutation Testing (MT): We use MutPy [35], a tool designed for implementing
MT. To assess the effectiveness of test cases, we calculate Mutation Score (MS),
which is the ratio of killed mutants out of all mutants of a PUT [40]. If the MS of
a unit test is below 100%, we proceed to transmit the test cases and the surviving
mutants to the “Prompt Augmentation,” step 6. Otherwise, we send the test cases
to “Oracle Minimization,” step 7.
6. Prompt Augmentation: This step focuses on improving the effectiveness of the
unit test using surviving mutants. To achieve this, we augment the initial prompt
by incorporating the unit test and one of the surviving mutants. Then, we re-
prompt LLMC using these augmented prompts. After collecting the new test
cases, we iterate through steps 3, 4, 5, and, if necessary, step 6 with another
surviving mutant until either the MS reaches 100% or all the surviving mutants
of the PUT have been incorporated into the initial prompt.
7. Oracle minimization: The test cases generated by the LLMC often contain
redundant assertions. Additionally, the augmentation process might introduce
further redundant test cases. To avoid confusing developers, similar to previous
tools that generate mutation-driven test oracles [40], we employ a Greedy
technique to reduce the number of assertions. This final step helps eliminate
redundant test cases and their assertions that do not contribute to improving the
MS.
Evaluation: To assess the test cases generated by our proposed method, we
utilize two different datasets. The first one is the HumanEval dataset [9],
which serves as a benchmark for evaluating LLMs. It has 164 human-written
programming problems at easy-to-medium levels. Each problem has different
attributes such as descriptions and reference solutions. We treat each task within
the HumanEval dataset as a PUT and evaluate the effectiveness of test cases
generated by LLMs in detecting injected bugs (mutants) of the tasks within this
dataset.
162 A. Moradi Dakhel et al.
Furthermore, we employ the tasks contained in the Refactory dataset, i.e., the
same dataset utilized in the first part of our study, as PUT and generate test cases
for those tasks with our proposed method. Subsequently, we apply these test cases,
both before and after the augmentation step, to the buggy code of students within
the dataset. In addition, we apply the test cases on the buggy code that we collected
from Copilot in the first study. This dataset enables us to evaluate the effectiveness
of test cases generated by LLMs in revealing real bugs within students’ code
submissions, as well as the buggy code generated by an LLM like Copilot. We
employ Pynguin [33], a state-of-the-art automatic test case generation tool, as the
comparable tool in this study.
In this section, we present the results of our research in addressing the questions
posed at the beginning of this chapter. Subsequently, we delve into the significant
takeaways from our findings and present key insights that can benefit practitioners
and researchers.
4.1 Results
In this section, we will share our findings from two studies aimed at addressing the
two following research questions:
RQ1: How can LLMs be employed to generate reliable code for implementing
a specific functionality?
The results of our first study in Table 2 show that Copilot’s CR in solving
the programming assignments lags behind students at 36.4% compared to 59%.
However, leveraging an SE program repair tool significantly increases the
correctness of Copilot’s suggestions from 36.4 to 98.5%, achieving a success rate of
95% in terms of repair rate. Moreover, the cost of repairing buggy code, quantified
in terms of repair time and Repair Patch Size (RPS), is lower for the buggy code
generated by Copilot in comparison to the buggy submissions of students.
Our observation also indicates that Copilot generates solutions that are
easier to understand and also employs more Pythonic programming keywords
in contrast to the student code [10]. This contrast is evident in the Cyclomatic
Complexity (CC) metric average across various tasks in Table 2. While both
students’ and Copilot’s CC values remain below 10, which can be attributed to
the simplicity of the tasks, the code generated by Copilot represents lower CC on
average compared to the students.
Generative AI for Software Development: A Family of Studies on Code Generation 163
Table 2 Copilot’s suggestions compared to students’ submissions. CR@k denotes the proportion
of accurate solutions among Copilot’s suggestions, collected from the top-k suggestions across
all five attempts
Cyclomatic
Correctness ratio Repairing cost complexity
Candidate .CR@1 .CR@5 .CR@10 .RepRate .AvgT ime .AvgRP S CC
Copilot 5.2% 21.6% 36.4% 95% 4.94 0.33 2.81
Students – – 59% 89% 6.48 0.35 3.87
While Copilot asserts the elimination of duplicate solutions from its top ten
suggestions, our results in Fig. 2 show the continued presence of duplicates
within such suggestions. This discrepancy stems from Copilot’s reliance on token
sequence comparison to identify and eliminate duplicates. As an illustration,
consider the last attempt, “atp5” for q3 in Fig. 2, among 28 correct solutions
generated by Copilot, only two remain unique. In contrast, for students out of 22
accurate solutions, 21 exhibited uniqueness. This result is also evident in another
study that users report Copilot gets into loops of suggesting the same thing [5].
The findings of the first study are based on the earliest release of Copilot (v1.7).
Although subsequent releases have introduced modifications that could influence
result replication, the methodology remains applicable to any of Copilot’s releases
and other LLMs for the purpose of evaluating the quality of its generated code.
RQ2: How can LLM be employed for generating effective test cases?
The findings from our second study, as presented in Table 3, explain that
both of the selected LLMs in our investigation—Codex and llama-2-chat—are
capable of generating test cases that exhibit higher effectiveness compared to those
produced by Pynguin. This enhanced effectiveness is observed after implementing
refining steps (Before-Augment), namely, the repair of syntax errors and the repair
of intended behavior. Additionally, augmenting prompt with surviving mutants
(After-Augment) further contributes to the enhancement of test case effectiveness.
For instance, when utilizing llama-2-chat with a few-shot initial prompt
approach, the MS improves from 85.16 to 93.57% for tasks within the HumanEval
dataset. Furthermore, the outcomes obtained from human-written buggy code in
students’ submissions and Copilot-generated buggy code validate the enhanced
effectiveness of the test cases in identifying bugs.
Fig. 2 The cumulative distribution of solutions by Copilot and students across three selected programming tasks. Correct (C), Non-duplicate Correct (NDC),
Buggy (B), and Non-duplicate Buggy (NDB). Attempts (atp) for students equal to a random sample set of their submission [10]
A. Moradi Dakhel et al.
Generative AI for Software Development: A Family of Studies on Code Generation 165
Table 3 Evaluation result of test cases generated by Codex and llama-2-chat 7B. The total number
of mutants in HumanEval dataset is 1260 [11]
.Mutants(H umanEval) .Buugycode(ref actory)
In another scenario , we break down the task description into three separate
prompts for each function: “unique_day”, “unique_month”, and “contains_unique
_day”. These prompts were structured as follows:
– unique_day: Given a day and a list of possible birth dates, the function should
return True if there is only one potential birthday with that day and False
otherwise.
166 A. Moradi Dakhel et al.
– unique_month: Given a month and a list of possible birth dates, the function
should return True if there is only one potential birthday within that month and
False otherwise.
– contains_unique_day: Given a month and a list of possible birth dates, the
function should return True if there is only one potential birthday with that
specific month and day and False otherwise.
We began with the prompt of the “unique_day” function, placed at the beginning
of the source file, and accepted Copilot’s first suggested solution. We followed
the same process for the “unique_month” and “contains_unique_day” func-
tions. In this scenario, the CR for “unique_day”, “unique_month”, and “con-
tains_unique_day” amounted to 88%, 0%, and 40%, respectively.
While this approach improved the CR for two of the prompts, the CR for
“unique_month” remained at zero. In the rest of this section, we delve into this
case and explore a practical approach to enhance the CR in this specific experiment.
Incorporating Input and Expected Output Pairs in the Prompt As Copilot
failed to generate any correct solutions for “unique_month” prompt in the previous
scenario, we conducted a manual review of Copilot’s suggested solutions. Our
observation revealed that in all the buggy solutions, Copilot incorrectly identified the
order of input items in the function as it was expected in the unit tests. An example
of such a unit test is provided below. As it is shown in this unit test, for each tuple
within the list of birthdays, such as (“January”, “1”), Copilot mistakenly assigned
the second item as the birthday month, whereas it should have been the first item in
the tuple:
– unique_month (Month = “January”, Birthdays = [( “January”,“1” ), ( “January”,
“2” )]).
The Benefit of Dialog LLM Our results in the second study indicate that the
dialog setup of llama-2-chat provides the flexibility to assign distinct roles to each
component of the augmented prompt. For instance, by assigning the initial unit test
generated by the model to the assistant role during the prompt augmentation process,
the likelihood of repeating the initial tests in the generated output is reduced, while
the chance of generating new test cases for detecting surviving mutants is increased.
4.3 Discussion
While LLMs as AI programming assistant tools generate code snippets that are com-
petitive with human-written code across various quality metrics, they still produce
code that is buggy, non-optimal, ineffective, syntactically incorrect, or misaligned
with intended functionality [10]. Developers express that when employing Copilot
to fulfill programming tasks, they find themselves less in coding and more in
reviewing[5]. Another study highlights that among the various possible interactions
with LLMs, up to 34% of developers’ engagements with LLMs involve debugging,
testing, and validating suggestions [34].
The investigation of the findings of our two studies suggests that applying
diverse post-processing steps, such as employing a tool to repair buggy code
generated by LLMs or re-prompting the LLM to repair syntax errors in its
own suggestions, can enhance the reliability in the code generated by LLMs.
Furthermore, for test cases , augmenting the prompt with information like
surviving mutants can improve the effectiveness of the test cases in revealing bugs.
One limitation of our two studies is the absence of a dataset from an industrial
context that includes programming task statements along with their corresponding
code. As a result, we chose to follow the approach of previous research in
software engineering by using traditional programming tasks, such as those in the
HumanEval dataset [9], to evaluate the effectiveness of LLM-based tools [38, 48,
56].
The focus of this chapter is not on the type or difficulty level of programming
tasks that LLMs can handle but rather on the quality of the code they generate
when used as AI pair programmers. We also explore methods to enhance this
code quality and enhance developers’ reliability on tools incorporating LLMs for
code generation. Therefore, we selected datasets for which LLMs are capable
of producing answers to their programming tasks. While these tasks might not
represent all the programming tasks a professional developer encounters, they
enable us to evaluate the capabilities and limitations of LLMs. Furthermore, we
provide recommendations to developers on how to effectively utilize this tool in
actual software projects. However, readers need to acknowledge these limitations
when attempting to generalize our results to more complex tasks.
168 A. Moradi Dakhel et al.
Another limitation might stem from the quality metrics used in our first study.
Although we employed a range of quantitative metrics, drawing inspiration from
previous research in software engineering, to compare Copilot’s code with that
produced by humans [15, 27, 38], the selected metrics primarily focus on functional
suitability and maintainability. However, in accordance with ISO/IEC 25010 and
relevant studies [51], there are other quality characteristics to consider, including
usability of software systems [28], performance efficiency, and security concerns,
which we have deferred to future works.
Overall, our results show that even though post-processing steps or
prompt augmentation lead to enhanced reliability and effectiveness, it remains
crucial for an expert to validate the final suggestions. A human study reveals
that while participants using LLM-based code assistance exhibit more confidence
in generating accurate code compared to those without such tools, their code is
significantly less correct upon actual evaluation [42].
Relying solely on LLM-based code assistance might pose a risk in SE
projects, particularly when novice developers put complete trust in its suggestions,
potentially overlooking buggy or suboptimal solutions due to their limited exper-
tise [10]. As emphasized by D. Wong et al. [59], while LLM-based code assistance
like Copilot is indeed a powerful tool for code-related tasks, it should not be flying
the plane by itself.
5 Conclusion
References
1. Ahmed, T., Devanbu, P.: Few-shot training llms for project-specific code-summarization.
In: Proceedings of the 37th IEEE/ACM International Conference on Automated Software
Engineering, pp. 1–5 (2022)
2. Akli, A., Haben, G., Habchi, S., Papadakis, M., Le Traon, Y.: Flakycat: Predicting flaky
tests categories using few-shot learning. In: 2023 IEEE/ACM International Conference on
Automation of Software Test (AST). pp. 140–151. IEEE, Piscataway (2023)
3. Alur, R., Bodik, R., Juniwal, G., Martin, M.M., Raghothaman, M., Seshia, S.A., Singh, R.,
Solar-Lezama, A., Torlak, E., Udupa, A.: Syntax-Guided Synthesis. IEEE, Piscataway (2013)
4. Barke, S., James, M.B., Polikarpova, N.: Grounded copilot: how programmers interact with
code-generating models. Proc. ACM Programm. Lang. 7(OOPSLA1), 85–111 (2023)
5. Bird, C., Ford, D., Zimmermann, T., Forsgren, N., Kalliamvakou, E., Lowdermilk, T., Gazit,
I.: Taking flight with copilot: early insights and opportunities of ai-powered pair-programming
tools. Queue 20(6), 35–57 (2022)
6. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A.,
Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Adv. Neural
Inf. Process. Syst. 33, 1877–1901 (2020)
7. Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., Zhang, C.: Quantifying memoriza-
tion across neural language models (2022). arXiv preprint arXiv:2202.07646
8. Chai, Y., Zhang, H., Shen, B., Gu, X.: Cross-domain deep code search with meta learning.
In: Proceedings of the 44th International Conference on Software Engineering, pp. 487–498
(2022)
9. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.D.O., Kaplan, J., Edwards, H., Burda, Y.,
Joseph, N., Brockman, G., et al.: Evaluating large language models trained on code (2021).
arXiv preprint arXiv:2107.03374
10. Dakhel, A.M., Majdinasab, V., Nikanjam, A., Khomh, F., Desmarais, M.C., Jiang, Z.M.J.:
Github Copilot AI pair programmer: asset or liability? J. Syst. Softw. 203, 111734 (2023)
170 A. Moradi Dakhel et al.
11. Dakhel, A.M., Nikanjam, A., Majdinasab, V., Khomh, F., Desmarais, M.C.: Effective test
generation using pre-trained large language models and mutation testing (2023). https://arxiv.
org/abs/2308.16557
12. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional
transformers for language understanding (2018). arXiv preprint arXiv:1810.04805
13. Dinella, E., Ryan, G., Mytkowicz, T., Lahiri, S.K.: Toga: A neural method for test oracle
generation. In: Proceedings of the 44th International Conference on Software Engineering,
pp. 2130–2141 (2022)
14. Ebert, C., Cain, J., Antoniol, G., Counsell, S., Laplante, P.: Cyclomatic complexity. IEEE
Softw. 33(6), 27–29 (2016)
15. Fakhoury, S., Roy, D., Hassan, A., Arnaoudova, V.: Improving source code readability: Theory
and practice. In: 2019 IEEE/ACM 27th International Conference on Program Comprehension
(ICPC), pp. 2–12. IEEE, Piscataway (2019)
16. Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang,
D., et al.: Codebert: A pre-trained model for programming and natural languages (2020). arXiv
preprint arXiv:2002.08155
17. Green, C.: Application of theorem proving to problem solving. In: Readings in Artificial
Intelligence, pp. 202–222. Elsevier, Amsterdam (1981)
18. Gulwani, S.: Automating string processing in spreadsheets using input-output examples. ACM
Sigplan Notices 46(1), 317–330 (2011)
19. Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., Wang, H.:
Large language models for software engineering: A systematic literature review (2023). arXiv
preprint arXiv:2308.10620
20. Hu, Y., Ahmed, U.Z., Mechtaev, S., Leong, B., Roychoudhury, A.: Re-factoring based
program repair applied to programming assignments. In: 2019 34th IEEE/ACM International
Conference on Automated Software Engineering (ASE), pp. 388–398. IEEE, Piscataway
(2019)
21. Husain, H., Wu, H.H., Gazit, T., Allamanis, M., Brockschmidt, M.: Codesearchnet challenge:
Evaluating the state of semantic code search (2019). arXiv preprint arXiv:1909.09436
22. Imai, S.: Is github copilot a substitute for human pair-programming? An empirical study.
In: Proceedings of the ACM/IEEE 44th International Conference on Software Engineering:
Companion Proceedings, pp. 319–321 (2022)
23. Jiang, N., Liu, K., Lutellier, T., Tan, L.: Impact of code language models on automated program
repair (2023). arXiv preprint arXiv:2302.05020
24. Jiang, X., Dong, Y., Wang, L., Shang, Q., Li, G.: Self-planning code generation with large
language model (2023). arXiv preprint arXiv:2303.06689
25. Jones, E., Steinhardt, J.: Capturing failures of large language models via human cognitive
biases. Adv. Neural Inf. Process. Syst. 35, 11785–11799 (2022)
26. Joshi, H., Sanchez, J.C., Gulwani, S., Le, V., Verbruggen, G., Radiček, I.: Repair is nearly
generation: Multilingual program repair with llms. In: Proceedings of the AAAI Conference
on Artificial Intelligence, vol. 37, pp. 5131–5140 (2023)
27. Kim, S., Whitehead Jr, E.J.: How long did it take to fix bugs? In: Proceedings of the 2006
International Workshop on Mining Software Repositories, pp. 173–174 (2006)
28. Komiyama, T., Fukuzumi, S., Azuma, M., Washizaki, H., Tsuda, N.: Usability of software–
intensive systems from developers’ point of view: Current status and future perspectives of
international standardization of usability evaluation. In: Human-Computer Interaction. Design
and User Experience: Thematic Area, HCI 2020, Held as Part of the 22nd International
Conference, HCII 2020, Copenhagen, Denmark, July 19–24, 2020, Proceedings, Part I 22,
pp. 450–463. Springer, Berlin (2020)
29. Lemieux, C., Inala, J.P., Lahiri, S.K., Sen, S.: Codamosa: Escaping coverage plateaus in test
generation with pre-trained large language models. In: International Conference on Software
Engineering (ICSE) (2023)
Generative AI for Software Development: A Family of Studies on Code Generation 171
30. Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J.,
Gimeno, F., Dal Lago, A., et al.: Competition-level code generation with alphacode. Science
378(6624), 1092–1097 (2022)
31. Liang, J.T., Yang, C., Myers, B.A.: Understanding the usability of AI programming assistants
(2023). arXiv preprint arXiv:2303.17125
32. Liventsev, V., Grishina, A., Härmä, A., Moonen, L.: Fully autonomous programming with large
language models (2023). arXiv preprint arXiv:2304.10423
33. Lukasczyk, S., Fraser, G.: Pynguin: Automated unit test generation for python. In: Proceedings
of the ACM/IEEE 44th International Conference on Software Engineering: Companion
Proceedings, pp. 168–172 (2022)
34. Mozannar, H., Bansal, G., Fourney, A., Horvitz, E.: Reading between the lines: Modeling user
behavior and costs in ai-assisted programming (2022). arXiv preprint arXiv:2210.14306
35. Mutpy: A mutation testing tool for Python 3.x source code. https://github.com/mutpy/mutpy
(2019)
36. Nashid, N., Sintaha, M., Mesbah, A.: Retrieval-based prompt selection for code-related few-
shot learning. In: Proceedings of the 45th International Conference on Software Engineering
(ICSE’23) (2023)
37. Nguyen, N., Nadi, S.: An empirical evaluation of github copilot’s code suggestions. In:
Proceedings of the 19th International Conference on Mining Software Repositories, pp. 1–5
(2022)
38. Nguyen, N., Nadi, S.: An empirical evaluation of GitHub Copilot’s code suggestions. In:
Accepted for Publication Proceedings of the 19th ACM International Conference on Mining
Software Repositories (MSR), pp. 1–5 (2022)
39. Pan, C., Lu, M., Xu, B.: An empirical study on software defect prediction using codebert
model. Appl. Sci. 11(11), 4793 (2021)
40. Papadakis, M., Kintis, M., Zhang, J., Jia, Y., Le Traon, Y., Harman, M.: Mutation testing
advances: An analysis and survey. In: Advances in Computers, vol. 112, pp. 275–378. Elsevier,
Amsterdam (2019)
41. Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., Karri, R.: Asleep at the keyboard? assessing
the security of github copilot’s code contributions. In: 2022 IEEE Symposium on Security and
Privacy (SP), pp. 754–768. IEEE, Piscataway (2022)
42. Perry, N., Srivastava, M., Kumar, D., Boneh, D.: Do users write more insecure code with ai
assistants? (2022). arXiv preprint arXiv:2211.03622
43. Prenner, J.A., Babii, H., Robbes, R.: Can openai’s codex fix bugs? An evaluation on quixbugs.
In: Proceedings of the Third International Workshop on Automated Program Repair, pp. 69–75
(2022)
44. Radon: A python tool to compute various metrics from the source code. https://radon.
readthedocs.io/en/latest (2019)
45. Sakib, F.A., Khan, S.H., Karim, A.: Extending the frontier of chatgpt: Code generation and
debugging (2023). arXiv preprint arXiv:2307.08260
46. Schäfer, M., Nadi, S., Eghbali, A., Tip, F.: Adaptive test generation using a large language
model (2023). arXiv preprint arXiv:2302.06527
47. Siddiq, M.L., Santos, J., Tanvir, R.H., Ulfat, N., Rifat, F.A., Lopes, V.C.: Exploring the
effectiveness of large language models in generating unit tests (2023). arXiv preprint
arXiv:2305.00418
48. Sobania, D., Briesch, M., Rothlauf, F.: Choose your programming copilot: A comparison of
the program synthesis performance of github copilot and genetic programming (2021). arXiv
preprint arXiv:2111.07875
49. Sobania, D., Briesch, M., Hanna, C., Petke, J.: An analysis of the automatic bug fixing
performance of chatgpt arxiv (2023). arXiv preprint arXiv:2301.08653
50. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra,
S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models
(2023). arXiv preprint arXiv:2307.09288
172 A. Moradi Dakhel et al.
51. Tsuda, N., Washizaki, H., Honda, K., Nakai, H., Fukazawa, Y., Azuma, M., Komiyama, T.,
Nakano, T., Suzuki, H., Morita, S., et al.: Wsqf: Comprehensive software quality evaluation
framework and benchmark based on square. In: 2019 IEEE/ACM 41st International Confer-
ence on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 312–321.
IEEE, Piscataway (2019)
52. Tufano, M., Drain, D., Svyatkovskiy, A., Deng, S.K., Sundaresan, N.: Unit test case generation
with transformers and focal context (2020). arXiv preprint arXiv:2009.05617
53. Tufano, M., Deng, S.K., Sundaresan, N., Svyatkovskiy, A.: Methods2test: A dataset of focal
methods mapped to test cases. In: Proceedings of the 19th International Conference on Mining
Software Repositories, pp. 299–303 (2022)
54. Tufano, M., Drain, D., Svyatkovskiy, A., Sundaresan, N.: Generating accurate assert statements
for unit test cases using pretrained transformers. In: Proceedings of the 3rd ACM/IEEE
International Conference on Automation of Software Test, pp. 54–64 (2022)
55. Turzo, A.K., Faysal, F., Poddar, O., Sarker, J., Iqbal, A., Bosu, A.: Towards automated classi-
fication of code review feedback to support analytics (2023). arXiv preprint arXiv:2307.03852
56. Vaithilingam, P., Zhang, T., Glassman, E.L.: Expectation vs. experience: Evaluating the
usability of code generation tools powered by large language models. In: Chi Conference on
Human Factors in Computing Systems Extended Abstracts, pp. 1–7 (2022)
57. Wang, Y., Wang, W., Joty, S., Hoi, S.C.: Codet5: Identifier-aware unified pre-trained
encoder-decoder models for code understanding and generation (2021). arXiv preprint
arXiv:2109.00859
58. Wang, C., Yang, Y., Gao, C., Peng, Y., Zhang, H., Lyu, M.R.: No more fine-tuning? an
experimental evaluation of prompt tuning in code intelligence. In: Proceedings of the 30th
ACM Joint European Software Engineering Conference and Symposium on the Foundations
of Software Engineering, pp. 382–394 (2022)
59. Wong, D., Kothig, A., Lam, P.: Exploring the verifiability of code generated by github copilot
(2022). arXiv preprint arXiv:2209.01766
60. Xia, C.S., Wei, Y., Zhang, L.: Automated program repair in the era of large pre-trained language
models. In: Proceedings of the 45th International Conference on Software Engineering (ICSE
2023). Association for Computing Machinery, New York (2023)
61. Yu, T., Gu, X., Shen, B.: Code question answering via task-adaptive sequence-to-sequence pre-
training. In: 2022 29th Asia-Pacific Software Engineering Conference (APSEC), pp. 229–238.
IEEE, Piscataway (2022)
62. Zamfirescu-Pereira, J., Wong, R.Y., Hartmann, B., Yang, Q.: Why johnny can’t prompt:
how non-AI experts try (and fail) to design llm prompts. In: Proceedings of the 2023 CHI
Conference on Human Factors in Computing Systems, pp. 1–21 (2023)
63. Zheng, Z., Ning, K., Chen, J., Wang, Y., Chen, W., Guo, L., Wang, W.: Towards an
understanding of large language models in software engineering tasks (2023). arXiv preprint
arXiv:2308.11396
64. Zhou, X., Han, D., Lo, D.: Assessing generalizability of codebert. In: 2021 IEEE International
Conference on Software Maintenance and Evolution (ICSME), pp. 425–436. IEEE, Piscataway
(2021)
65. Ziegler, A., Kalliamvakou, E., Li, X.A., Rice, A., Rifkin, D., Simister, S., Sittampalam, G.,
Aftandilian, E.: Productivity assessment of neural code completion. In: Proceedings of the 6th
ACM SIGPLAN International Symposium on Machine Programming, pp. 21–29 (2022)
BERTVRepair: On the Adoption of
CodeBERT for Automated Vulnerability
Code Repair
Nguyen Ngoc Hai Dang, Tho Quan Thanh, and Anh Nguyen-Duc
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 173
A. Nguyen-Duc et al. (eds.), Generative AI for Effective Software Development,
https://doi.org/10.1007/978-3-031-55642-5_8
174 N. N. H. Dang et al.
1 Introduction
2 Background
This section provides basic knowledge about Transformer architecture and BERT
model (Sect. 2.1), a description of CodeBERT and UnixCoder (Sect. 2.2), and recent
research on automated code repair (Sect. 2.3).
This section presents fundamental DL architectures that provide the foundation for
our approach, which are the Transformer architecture and BERT model.
Transformer
Although the RNN model combined with attention is widely applied to Seq2Seq
problems and has achieved quite good results, this architecture still has some major
drawbacks:
– An RNN node only receives semantic information from previous nodes (words
on the left) and lacks information from nodes located after (words on the right).
Although the bidirectional RNN variant [15] can somewhat solve this problem,
it increases the computation time while the information is still transmitted
indirectly.
– The model calculates sequentially: the result of the current calculation depends
on the result of the previous calculation. Therefore, data from different transmis-
sions cannot be performed in parallel, leading to not fully utilizing the GPU.
– It can “forget” or “lose” data when information is transmitted through many
network nodes during the calculation due to too large or too small derivatives
(vanishing/exploding gradient [16]). This hinders the embedding of context
information of words in the sentence for the word representation vector. Although
some variants of RNN such as GRU [17] and LSTM [18] have the ability to
selectively decide to keep/discard information at each network node as presented
in the previous chapter, when going through many intermediate nodes, the
information is still lost a lot.
To overcome these drawbacks, in 2017, the research team of Google Brain
introduced the Transformer model [19]. This model is gradually being widely used
in the field of natural language processing—NLP [19]—and even Image Process-
ing (Computer Vision, CV) [20]. Like RNN networks, Transformer is designed
to handle sequential data, such as tasks like machine translation or automatic
summarization in the field of NLP. However, unlike RNN, Transformer does not
require sequential data to be processed in order. For example, if the input data is a
natural language sentence, Transformer does not need to process the beginning of
176 N. N. H. Dang et al.
the sentence before the end of the sentence. Because of this feature, Transformer
allows many parallel calculations and therefore reduces training time. Especially
the operation mechanism of Transformer allows information at each network node
to be aggregated in both directions at the same time in a quite natural way, thereby
significantly improving performance when applied to machine learning problems.
The Transformer model also follows the seq2seq architecture, but with Encoder and
Decoder built from scaled dot-product attention and feed-forward blocks as shown
in Fig. 1. Besides the modules commonly found in neural network architectures,
Transformer introduces three new techniques:
BERT
Since the advent of the Transformer model, many large language models have
been pre-trained on Transformer. Typical examples are BERT [21] using the
encoder architecture of Transformer, GPT-2 [22] using the decoder architecture of
Transformer, Transformer XL [23] using the recurrent decoder architecture based on
the decoder of Transformer. Applying these large language models to downstream
applications has yielded impressive results in many NLP problems. Since 2018,
Devlin and colleagues proposed the Pre-training of Deep Bidirectional Transformers
for Language Understanding (BERT) model [21]. BERT is trained on a large amount
of data and can be seen as a pre-trained model that is often fine-tuned with some
simple neural layers to create state-of-the-art models that solve various problems
in the field of natural language processing such as question answering [24] or
language inference [25]. BERT uses the same architecture as the encoder block of
Transformer and is trained on two objectives: masked language modeling (MLM)
and next sentence prediction (NSP). For the MLM objective, BERT will randomly
mask 15% of the tokens in the input sequence (with 80% of them actually being
replaced by the token .[MASK], 10% of the tokens being randomly replaced by
another token, and 10% remaining unchanged), and then BERT is trained to predict
the masked tokens based on the context of the surrounding tokens. Therefore, the
output of BERT can be used as an embedding layer that has been embedded with
contextual information from both directions, thanks to the self-attention mechanism
in each Transformer encoder layer. Figure 2 illustrates how BERT predicts and
trains for this objective. After the input data passes through the Encoding layers
of Transformer, we will get the final embedding vectors. These embedding vectors
continue to go through an FC layer to create the probability distribution vectors of
the vocabulary at the corresponding position in the sentence. This is also BERT’s
prediction for the probability of occurrence of each word at the output position of
the model. BERT will compare the predicted probability of the word appearing at
the [MASK] position and compare it with the corresponding word in the ground truth
to determine the loss value for this training and update the corresponding weight.
178 N. N. H. Dang et al.
Fig. 2 Overview of the prediction and training process of the BERT model
Fig. 3 Thanks to the MLM objective and the encoder architecture of Transformer, BERT can be
used as an embedding block to extract contextual information from both sides
With the two training objectives, BERT offers the following advantages.
– The MLM training strategy helps BERT efficiently represent the contextual
semantics of the training corpus, which are encoded as embedding vectors. Once
applied in another corpus, such that one of programming languages, this strategy
can help embed the corresponding semantics of the training corpus.
– The NSP training strategy allows BERT to train another supervised task, which
takes advantage of the semantic encoding offered by MLM. By controlling the
meaning of the .[CLS] token, one can leverage BERT-based model for other
supervised task.
2.2 CodeBERT
The success of BERT in NLP has led to CodeBERT model for code generation [26].
This model borrows the same design and training process of Roberta [27] in the
context of multilingualism. CodeBERT is trained on two types of data: the code
segments and their documentation. CodeBERT is trained with a dataset that is a
mixture of programming languages with no indication in the model to differentiate
between these languages.
Training of code and documentation use the same tokenization process as in a
standard text processing pipeline. The generated sequences of tokens would then
be inserted with some special tokens formulating the following sequence, which
is used as input for the model, and the model would need to output (1) dense
vector representations of both code tokens and word tokens along with (2) the
representation of [CLS] stated in [26]. In order to generate such representation,
the model is trained on two learning objectives: masked language modeling and
replaced token detection.
– The second pre-training task is the conditional text generation that makes the
model learn to generate the respective comment with the flattened AST of the
code segment.
Methods used for creating code patches can be classified into template-based and
generative-based.
BERTVRepair 181
sequence. The graphs of edit sequences are then split into connected components,
and DBSCAN is used to cluster these components into edit groups. For each pair
of edit sequences in the edit groups, a template is created by identifying the longest
overlapping edit sub-sequence and the context of this sub-sequence. The editing
context is also the output of GumTree that specifies the locations of edit operations
in the code segments.
In the repair phase, we will find the appropriate templates for each unseen bad
code segment based on their similarity with the known bad code in the learning
phase data set. After that, the chosen template will then be further fine-grained
to match the bad code. Templates are selected by comparing the input code with
edit groups’ templates mined in the learning process. The input code then used
the transformative operations specified in the templates’ edit pattern to create code
patches and only keep patches that do not contain redundant code
Source code embedding is a NLP concept about the transformation of source code
into numerical representations, making it understandable and processable by ML
algorithms. This enables the creation of dense vector representation of language
learning by training the model on a huge corpus, resulting in a general-purpose
representation that can preserve the semantic relationships between words to the
extent that these relationships can be expressed through the arithmetic operations
[30]. However, in order to learn such representations, the requirements are (1) an
enormous amount of data and (2) unlimited computing resources; due to these
reasons, these pre-trained models are normally provided by tech giants like Google
with their BERT architecture, and its usages in different tasks have been pushing
performance boundaries.
the sequence-to-sequence framework and use transfer learning from code repairing
task to vulnerabilities repairing task. Due to the commons in their design, we will
go through them using a three-phase structure: pre-processing, pre-training and then
fine-tuning for transfer learning, and patch generation.
SeqTrans
SeqTrans [31] used transfer learning to fine-tune the model that first trained on the
bug repairing task to the vulnerabilities repairing task; for that reason, each task will
need a separate data set. In reported experiments, the model trained on bug repairing
task using the Tufano [32] dataset, and then in the fine-tuning phase, the Ponto [33]
dataset is used, with both of these dataset being source code of Java language.
Before going into the details of this method, let us look at the general design of
the architecture in Fig. 7.
Tokenization and Normalization Although SeqTrans does not use tree represen-
tation of the source codes, the GumTree algorithm is still used to map the AST nodes
of the source and patch so that the diff context can be extracted using a commercial
tool Understand. Each sample from both datasets of bug repair and vulnerability
repair is represented as a list of code segment pairs
These code pairs are further fine-grained to construct the def-use chains, which
are the assignment of some value to a variable containing all variable definitions
from the vulnerable statement [31] turning the code pairs into. Figure 8 shows a
sample of code pair input for the model, in which all global variable definitions
and statements have dependencies with the vulnerability statements preserved while
other statements in the same method discarded
After the code pairs dataset has been created, each code segment would first be
normalized to reduce the vocabulary size of the dictionary, which also determines
the output vectors size as it denotes the probability of each token in the dictionary
being the predictions; this, in turn, would ease the models training processed: each
literal and strings would be turned into .num1 , . . . , numn , and .str1 , . . . , str2 , and
variable name would be replaced with .var1 , . . . , varn as shown in Fig. 9; however,
these “placeholders” will later be replaced back with their real value using mappings
generated during this normalization process. At this state, the input is ready to be
tokenized with Byte Pair Encoding [34] along with the dictionary of the dataset.
Pre-training and Fine-Tuning SeqTrans is designed using transformer modules
as building blocks, whose details have been discussed in the previous sections.
The architecture is the same for both the pre-training and fine-tuning phases with
the only differences in the dataset, batch size, and number of training steps: the
pre-training model is trained with a batch size of 4096 for 300k steps, and fine-
tuning model is trained with a batch size of 4096 for extra 30k steps [31]. The
BERTVRepair 185
VRepair
1 https://opennmt.net/
186 N. N. H. Dang et al.
fixes across multiple lines within a single code block, provides robustness to the
existing models [29, 36, 37].
CodeBERT/UnixCoder
Second
phase
Embedding lookup table
word 2 vector 1
Dictionary word 2 vector 2
Tokenization
1 word 1
word 3 vector 3
2 word 2
3 word 3
Preporcessed
Patch
buggy code
First
phase
Seq2Seq model
We treat source code along with its patch as plain texts shown in Fig. 12 and tokenize
them using a byte-pair encoding algorithm. Before the tokenization process, we
have an extra preprocessing step that adds two special tokens to the original dataset,
which are the input sequence and target sequence. This processing step is done
before we extract the embeddings from data using programming language models.
– In the input sequence shown in Fig. 13, .< StarLoc > and .< EndLoc > will
be added to the location identified as vulnerable, and there is also an additional
indicator “CWE-xxx,” which specifies the type of vulnerability.
– For the target sequence shown in Fig. 14, we use two new unique tokens .<
ModStart > and .< ModEnd > and change the target sequence to only contain
the modifications needed. There are three types of modifications to be made to
the input to create a patch, which is shown in Fig. 15, leading to three types of
format of the target sequence, each indicating a type of modification made to the
input sequence.
(a) (b)
Fig. 12 An example of buggy code and its patch. (a) An example of buggy code. (b) An example
of patch
<ModStart> index < len && index >= 0 <ModStart> else { return -1 <ModEnd> ; } }
3 context tokens Tokens inserted after `index < len´ Token replaces `index´
The embeddings extracted from language models such as CodeBERT are stored as
a lookup table of the vocabulary of the corpus; therefore, we are required first to
extract the vocabulary of words that exist in the corpus, which should include the
newly added tokens. Each word in the vocabulary is represented by a vector shown
in Fig. 11, which has a size of 768 and represents the semantic representation learned
by the pretraining tasks of programming language models.
In this process, we use two different large language models to extract embeddings
from; one is CodeBERT [26], and the other is Unixcoder [28]; both of these are
trained on a large dataset of programming languages including the ones in our
vulnerability dataset. The main difference between these two pre-trained is the
type of input on which they trained; the information that CodeBERT used to train
includes both natural language and programming language, while Unixcoder is
trained on programming language only. This difference stems from the fact that
they aim at optimizing their performance on different tasks; the first one emphasizes
code summarization, while the latter is better at autoregressive tasks such as code
completion.
The extracted embeddings representing the entire vocabulary in the corpus are
stored as a lookup table, which will later be used as input during the training process
of the downstream translation model of vulnerability repairing. One thing to note
here is that the programming language models used in our experiments have their
tokenizers and dictionaries. Therefore, the input tokenized by OpenNMT’s tokenizer
might further be tokenized in these language models, which leads the output tensors
to have shape .nx768, in which n is the number of tokens created from the input
token. For example, the token .word1 might further be tokenized into .subword1
and .subword2, making the output tensor of the language models have the size
of .2x768. To create an embedding of size .1x768 representing one single token
in our dictionary, we use two methods to aggregate the language models’ output
tensors. First, we take the mean of the output tensor along the second dimension.
The code snippet of this method is shown below, in which the tokenizer is
the tokenizer used by the programming language model to map the vocab in the
vulnerability dataset into indexes respective to the dictionary of the programming
language models. These indexes are then fed into the language models to get the
output tensor whose mean is taken along the second dimension.
BERTVRepair 189
Second, whenever we feed a token in our dictionary into the language models, we
concatenate it with a special token named .[cls], which intuitively represents the
semantic information of the entire tokens. By using this method, we only need to
use the first row of the language model output to act as the embedding. The code
snippet of this method resembles the first method with the only difference in the
additional .[cls] token used as embedding.
4 Experiments
4.1 Dataset
We used the existing dataset provided by VRepair, which are Big-Vul [39] and CVE-
fixes [40], for training the neural translation network [35]. The big-Vul dataset was
created by crawling CVE databases and extracting vulnerability-related information
such as CWE ID and CVE ID. Big-Vul contains 3754 different vulnerabilities across
348 projects categorized into 91 different CWE IDs, with a time frame spanning
from 2002 to 2019. CVEfixes dataset is collected in a way similar to the Big-Vul
dataset. This dataset contains 5365 vulnerabilities across 1754 projects categorized
into 180 different CWE IDs, with a time frame spanning from 1999 to 2021. In our
research, we only conducted the experiments on Big-vul to narrow down the scope
of experiments in this thesis and then left the experiments on a more diverse dataset
in future works.
190 N. N. H. Dang et al.
To train and validate our experiments, we split the datasets into training data,
validation data, and test data with .70% for training, .10% for validation, and .20%
for testing. In the Big-vul dataset, we will have 2228 samples as training data, 318
samples as validation data, and 636 samples as testing data.
OpenNMT-py framework [41] reports two measures, the perplexity (PPL) and the
accuracy during training and validation. PPL is a measure of how uncertain the
network is that the predicted outcome is correct. Low PPL means low uncertainty,
while high PPL means high uncertainty. Luong et al. report that the translation
quality is connected to the PPL [42], where they claim that if a model has a low
PPL, its translation will be of higher quality. The PPL is defined in Eq. 1, in which
the network’s uncertainty of the generated document D is measured by the joint
probability of all words in that document, normalized by the number of words .Nd
in the document:
( Σ|D| )
d=1 logP (wd )
.Perplexity(D) = exp − Σ|D| (1)
d=1 Nd
where .Ŷ is the predicted output sequence and Y is the target sequence. The way
accuracy is calculated makes it a metric that does not give much insight into the
results and the model’s performance. The reason is that even if all the tokens present
in the target sequence, Y , are present in the predicted sequence .Ŷ , the positions of
each token can be different from the target sequence, and the accuracy will still be
.100%.
BLEU score [43] is a metric that is used specifically for evaluating the quality
of text in the machine translation problem, which is based on calculating the
predictions’ precision on an n-length subsequence of the prediction sequence. The
precision metric in the context of machine translation is the number of words in
the predicted tokens that also appear in the target sequence. The calculation of
the BLEU score is done using the following Eq. 3, in which N is the length of the
subsequence in the predicted sequence:
1 ΣN
BLEU = e N
. n=1 Pn (3)
BERTVRepair 191
4.4 Environment
All of our experiments are conducted on a machine with 32G of RAM and one
NVIDIA Quadro RTX 6000 24GB GDDR6. To train, predict, and create the
vocabularies for both embedding extraction along our translation model, we use
the OpenNMT-py framework [41]. It is a neural machine translation framework
built on top of Pytorch [44]. The programming language models used in our
experiments are all accessible through the Hugging Face hub and implemented with
the Transformers framework [45].
5 Result
the first phase, in which the learning rate is set to .0.0005, the hidden size is 768,
and the sequence length is 2000. The reason behind this is that we just want to
clarify the effect of using embeddings on the downstream model, and using the
same hyperparameters helps us correctly attribute any improvement in terms of
performance to the use of embeddings. Along with that, as mentioned in the earlier
section, we also chose to reduce the training iteration from 100000 in the previous
experiments to 20000 with the same justification mentioned.
As mentioned in Sect. 3.2, we fed each word in the vocabulary into the program-
ming language model to get the representation of the word; however, each of these
models has its own input specification, leading to the differences in output. The
experiments denoted with postfix .(1) using the embeddings extracted from the first
method, which is aggregation along the second dimension of the language models
BERTVRepair 193
output tensor, and the one with postfix .(2) is using the .[cls] token as embedding.
Similar to the first phase of our experiment, results in our second phase are also
reported on token-level accuracy, perplexity, and training time. In addition, we also
report the models’ performance on their capability to generate the perfect patches
that entirely match the samples’ labels. The results show that the use of embeddings
extracted from CodeBERT by the latter method does help improve the performance
slightly.
The code snippet below is an example of the perfect patches generated from the
models using samples from the validation dataset, which has the format described in
Sect. 3.1. In this specific example, the predict sequence indicates that the generated
patch will insert memset between stride ) ; and ( input , at every place
in the original code that has such pattern
Results from both of these experiments (Tables 1 and 3) show that the use of
pre-trained embeddings does only marginal improvement compared to training the
models from scratch in terms of models’ performance and training time. We argue
that the similarity between code repairing and vulnerability repairing tasks is not
close enough for the embeddings to be used as a medium for transferring infor-
mation to improve the training process of the vulnerability repairing model. The
experiments in our second phase that use embeddings only show a slight increase in
BLEU score and exact match when compared to the vanilla pipeline, although the
BLEU score reaching 30 in Table 3 is considered to be understandable according to
[46]. The high perplexity in the results of both phases of the experiments shows that
the models are not certain in their predictions, which means the probability of the
correctly predicted token is not much larger than others.
6 Conclusions
also a type of bug, it is more difficult to detect and patch due to the endeavor to
exploit the application for security errors taking more time. This led to the fact
that the available labeled dataset of vulnerability source code is sparse compared to
the generic bugs, which makes the application of deep networks in the problem of
vulnerability repairing become limited. The literature review has shown that most of
the noticeable research on code repairing or vulnerability repairing recently focused
on learning the patterns in the dataset from the perspective of natural language in
which the input is either represented as lists of tokens or an abstract syntax tree.
We attempted to improve the performance of a state-of-the-art model VRepair, in
which a transformer-based model is trained to generate vulnerability code patches.
We proposed BERTVRepair, a model that leverages the embeddings extracted
from CodeBERT and UnixCoder to serve as a medium for transferring knowledge
learning from a larger dataset to a vulnerability-repairing task. However, it seems
that the code embeddings do not offer a significant improvement on the task, and
while conducting this research, we also find out that researchers have also conducted
the same experiments [49] on the tasks of vulnerability detections and archived the
same analysis as our experiments.
Our work can be used as justification that the task of vulnerability repairing
and code understanding tasks do not have close proximity. However, due to the
complex nature of the vulnerability, we can try to lower the tasks’ proximity by
focusing on one type of vulnerability most likely to resemble a code understanding
task. Another approach can also be considered, which is doing feature engineering
on the vulnerability dataset using traditional machine learning methods, and the
justification is that the complex nature of vulnerability can be further explored
through the use of code representation like dataflow suggested in [50].
References
1. Nguyen Duc, A., Chirumamilla, A.: Identifying security risks of digital transformation - an
engineering perspective. In: Pappas, I.O., et al. (eds.) Digital Transformation for a Sustainable
Society in the 21st Century, Cham, pp. 677–688 (2019)
2. Aalvik, H., et al.: Establishing a security champion in agile software teams: a systematic
literature review. In: Arai, K. (ed.) Advances in Information and Communication, Cham,
pp. 796–810 (2023)
3. McGraw, G., Potter, B.: Software security testing. IEEE Secur. Privacy 2(5), 81–85 (2004).
Number: 5
4. Le, T.H.M., Chen, H., Babar, M.A.: A survey on data-driven soft-ware vulnerability assessment
and prioritization. ACM Comput. Surv. 55(5), 100:1–100:39 (2022)
5. Wysopal, C.: Art of Software Security Testing, The: Identifying Software Security Flaws:
Identifying Software Security Flaws, 1st edn., 298 pp. Addison-Wesley Professional, Upper
Saddle River (2006)
6. Goues, C.L., Pradel, M., Roychoudhury, A.: Automated program repair. Commun. ACM
62(12), 56–65 (2019)
7. Li, Y., Wang, S., Nguyen, T.N.: DEAR: a novel deep learning-based approach for automated
program repair. In: Proceedings of the 44th International Conference on Software Engineering,
New York, July 5, pp. 511–523 (2022)
BERTVRepair 195
8. Fu, M., et al.: VulRepair: a t5-based automated software vulnerability repair. In: Proceedings
of the 30th ACM Joint European Software Engineering Conference and Symposium on the
Foundations of Software Engineering, New York, November 9, pp. 935–947 (2022)
9. Saha, R.K., et al.: Elixir: effective object-oriented program repair. In: 2017 32nd IEEE/ACM
International Conference on Automated Software Engineering (ASE), pp. 648–659 (2017)
10. Tian, H., et al.: Evaluating representation learning of code changes for predicting patch
correctness in program repair. In: Proceedings of the 35th IEEE/ACM International Conference
on Automated Software Engineering, New York, Jan. 27, pp. 981–992 (2021)
11. Zhang, S., et al.: Deep learning based recommender system: a survey and new perspectives.
ACM Comput. Surv. 52(1), 1– 38 (2020)
12. Vasic, M., et al.: Neural program repair by jointly learning to localize and repair, Apr. 2 (2019).
https://doi.org/10.48550/arXiv.1904.01720. arXiv: 1904.01720[cs, stat]. [Online]. Available:
http://arxiv.org/abs/1904. 01720 (visited on 05/31/2023)
13. Schramm, L.: Improving performance of automatic program repair using learned heuristics.
In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, New
York, Aug. 21, pp. 1071–1073 (2017)
14. Mashhadi, E., Hemmati, H.: Applying CodeBERT for automated program repair of java simple
bugs. In: 2021 IEEE/ACM 18th International Conference on Mining Software Repositories
(MSR), pp. 505–509 (2021)
15. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal
Proces. 45(11), 2673–2681 (1997)
16. Hochreiter, S.: The vanishing gradient problem during learning recurrent neural nets and
problem solutions. Int. J. Uncertainty Fuzziness Knowledge Based Syst. 6(02), 107–116 (1998)
17. Cho, K., et al.: Learning phrase representations using RNN encoder–decoder for statistical
machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), Doha, pp. 1724–1734 (2014)
18. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780
(1997)
19. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Proces. Syst. 30, 5998–6008
(2017)
20. Liu, Y., et al.: A survey of visual transformers. IEEE Trans. Neural Networks Learn. Syst.,
1–21 (2023)
21. Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language under-
standing (2018). arXiv preprint. arXiv:1810.04805
22. Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8),
9 (2019)
23. Dai, Z., et al.: Transformer-XL: attentive language models beyond a fixed-length context
(2019). arXiv preprint. arXiv:1901.02860
24. Dwivedi, S.K., Singh, V.: Research and reviews in question answering system. Proc. Technol.
10, 417–424 (2013)
25. MacCartney, B.: Natural Language Inference. Stanford University, Stanford (2009)
26. Feng, Z., et al.: CodeBERT: a pre-trained model for programming and natural languages
(2020). arXiv preprint. arXiv:2002.08155
27. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach, Jul. 26, 2019.
https://doi.org/10.48550/arXiv.1907.11692. arXiv: 1907.11692[cs]. [Online]. Available: http://
arxiv.org/abs/1907.11692 (visited on 11/19/2023)
28. Guo, D., et al.: UniXcoder: unified cross-modal pre-training for code representation (2022).
arXiv preprint. arXiv:2203.03850
29. Ma, S., et al.: VuRLE: automatic vulnerability detection and repair by learning from examples.
In: Foley, S.N., Gollmann, D., Snekkenes, E. (eds.) Computer Security – ESORICS 2017,
vol. 10493, pp. 229–246 (2017)
30. Church, K.W.: Word2vec. Nat. Lang. Eng. 23(1), 155–162 (2017)
31. Chi, J., et al.: SeqTrans: automatic vulnerability fix via sequence to sequence learning. IEEE
Trans. Software Eng. 49, 564–585 (2020)
196 N. N. H. Dang et al.
32. Tufano, M., et al.: An empirical investigation into learning bug-fixing patches in the wild via
neural machine translation. In: 2018 33rd IEEE/ACM International Conference on Automated
Software Engineering (ASE), New York, pp. 832–837 (2018)
33. Ponta, S.E., et al.: A manually-curated dataset of fixes to vulnerabilities of open-source soft-
ware. In: Proceedings of the 16th International Conference on Mining Software Repositories,
Montreal, Quebec, pp. 383–387 (2019)
34. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword
units. In: Proceedings of the 54th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), Berlin, pp. 1715–1725 (2016)
35. Chen, Z., Kommrusch, S., Monperrus, M.: Neural transfer learning for repairing security
vulnerabilities in C code. arXiv:2104.08308 [cs] (2022)
36. Chen, Z., Kommrusch, S., Monperrus, M.: Neural transfer learning for repairing security
vulnerabilities in C code. IEEE Trans. Software Eng. 49(1), 147–165 (2023)
37. Guo, J., et al.: A deep look into neural ranking models for information retrieval. Inf. Process.
Manage. 57(6), 102067 (2020)
38. Chen, Z., Kommrusch, S., Tufano, M., Pouchet, L.-N., Poshyvanyk, D., Monperrus, M.:
SequenceR: sequence-to-sequence learning for end-to-end program repair. IEEE Trans. Soft-
ware Eng. 47(9), 1943–1959 (2021). https://doi.org/10.1109/TSE.2019.2940179
39. Fan, J., et al.: A c/c++ code vulnerability dataset with code changes and cve summaries. In:
Proceedings of the 17th International Conference on Mining Software Repositories, Seoul,
pp. 508–512 (2020)
40. Bhandari, G., Naseer, A., Moonen, L.: CVEfixes: automated collection of vulnerabilities and
their fixes from open-source software. In: Proceedings of the 17th International Conference on
Predictive Models and Data Analytics in Software Engineering (2021)
41. Klein, G., et al.: OpenNMT: open-source toolkit for neural machine translation (2017). arXiv
preprint. arXiv:1701.02810
42. Luong, T., et al.: Addressing the rare word problem in neural machine translation. In:
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and
the 7th International Joint Conference on Natural Language Processing, Beijing, pp. 11–19
(2015)
43. Blagec, K., et al.: A global analysis of metrics used for measuring performance in natural
language processing (2022). arXiv: 2204.11574
44. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learn-ing library (2019).
arXiv: 1912.01703 [cs.LG]
45. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings
of the 2020 Conference on Empirical Methods in Natural Language Processing: System
Demonstrations, pp. 38–45 (2020)
46. Evaluating models. https://cloud.google.com/translate/automl/docs/evaluate. Accessed 06 June
2023
47. Chen, Z., et al.: Sequencer: Sequence-to-sequence learning for end-to-end program repair.
IEEE Trans. Software Eng. 47(09), 1943–1959 (2021)
48. Li, Z., et al.: SySeVR: a framework for using deep learning to detect software vulnerabilities.
IEEE Trans. Dependable Secure Comput. 19, 1–1 (2021)
49. Choi, Y., et al.: Learning sequential and structural information for source code summarization.
In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 2842–
2851 (2021)
50. Guo, D., et al.: GraphCodeBERT: pre-training code representations with data flow (2020).
arXiv preprint. arXiv:2009.08366
ChatGPT as a Full-Stack Web Developer
Väinö Liukko , Anna Knappe , Tatu Anttila, Jyri Hakala, Juulia Ketola,
Daniel Lahtinen, Timo Poranen , Topi-Matti Ritala, Manu Setälä,
Heikki Hämäläinen, and Pekka Abrahamsson
Abstract The arrival of ChatGPT has also generated significant interest in the
field of software engineering. Little is empirically known about the capabilities of
ChatGPT to actually implement a complete system rather than a few code snippets.
This chapter reports the firsthand experiences from a graduate-level student project
where a real-life software platform for financial sector was implemented from
scratch by using ChatGPT for all possible software engineering tasks. The resulting
code was reviewed by a seasoned software engineering professional. The static
code analysis was performed by using commercial software. The main conclusions
drawn are as follows: (1) these findings demonstrate the potential for ChatGPT to
be integrated into the software engineering workflow; (2) it can be used for creating
a base for new components and for dividing coding tasks into smaller pieces; (3)
noticeable enhancements in GPT-4, compared to GPT-3.5, indicate superior working
memory and the ability to continue incomplete responses, thereby leading to more
coherent and less repetitive dialogues; and (4) ChatGPT produced code that did not
include any major errors but requires efficient prompting to be effective.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 197
A. Nguyen-Duc et al. (eds.), Generative AI for Effective Software Development,
https://doi.org/10.1007/978-3-031-55642-5_9
198 V. Liukko et al.
1 Introduction
3 Research Design
Solita Ltd. [27] is a large software consultancy company in the Nordic countries.
Solita collaborates with universities by inventing exercise topics and supervising
student exercises. As part of a university project work course, they challenged the
student team to undertake an AI-assisted, large-scale project.
Project topic was chosen from well-defined public procurement requests on the
Hilma portal [11], a Web site for procurement in the Finnish public sector. It was
agreed in the project that the specifications would not be directly used as input
material for AI, but the prompts given to AI were written mostly by the team
themselves. However, the number of fields in the user interface was kept the same
as in the original request, etc.
The selected project, Valvontatyöpöytö (VTP), is a platform for financial super-
vision, designed to support the operations of an organization. The intended user
group for the VTP is financial professionals, including supervisors, managers, and
analysts.
The VTP project [34] was proposed in the end of December 2022. The project was
then accepted by a seven member team. The project started at the end of January
2023. The team consists of three masters-level students and four bachelor-level
students of computer science or information technology. None of the team members
200 V. Liukko et al.
The team’s process of working with ChatGPT is illustrated in Fig. 2. When a task
was related to existing code, the assigned team member would provide the relevant
code to ChatGPT and request it to generate a solution. If the task was not related to
existing code, the team member would first ask the ChatGPT for recommendations
before requesting it to produce the code. Once ChatGPT generated the code, the
team member would review it for correctness. If the code was deemed satisfactory,
202 V. Liukko et al.
the team member would add it to the code base, save the chat session as a Markdown
[10] file, and create a pull request with the chat as an attachment. If the code was not
acceptable, the team member would provide the problematic code back to ChatGPT
and repeat the process, iterating until a satisfactory solution was achieved or asking
for a new recommendation for another approach.
In the team’s experience with the described workflow, ChatGPT’s efficacy varied
depending on the nature of the task at hand. While the process appears quite linear
and systematic when documented in this manner, it often felt less straightforward
in practice, involving considerable trial and error. This experience suggests that
a more experienced team might have navigated these challenges more efficiently,
potentially avoiding some of the iterative cycles encountered. Challenges arose,
for instance, when attempting to convey concepts that were inherently visual or
spatial. In such scenarios, the limitations of a text-based interface became evident.
Ideas that might be effortlessly communicated through diagrams or images proved
cumbersome and, at times, inefficient to describe solely with words. Similarly, tasks
that necessitated only minor code adjustments could have been more efficiently
addressed manually, rather than submitting the code back to ChatGPT and detailing
the required modifications.
The project team logged their weekly working hours according to different cate-
gories (Documentation, Requirements, Design, Implementation, Testing, Meetings,
Studying, Other, Lectures). In the end, the project had a total of 793 hours logged as
shown in Table 2. Throughout the project, the project team met with the customer
twice a week to ensure quality assurance and planning were on track. This is the
reason why meetings (265 hours, 33%) have such a considerable part in the logged
hours. The implementation took 195 hours (25%) and studying 131 hours (17%).
The logged time also included university course-related subjects, such as lectures
and other studying, which did not relate directly to the implementation of the
project.
Table 2 Project’s working hours and activities from sprints 0 to QA during time period
23.1.2023–30.4.2023
S0 S1 S2 S3 S4 S5 QA Total
Doc. 13.5 0.0 2.0 7.0 5.8 10.8 8.0 47.1(6%)
Req. 8.0 2.0 1.0 1.0 2.0 0.0 0.0 14.0(2%)
Des. 16.0 9.7 7.0 7.5 1.0 0.5 4.0 45.7(6%)
Impl. 0.0 22.2 38.5 37.0 33.6 64.4 0.0 195.7(25%)
Test. 0.0 2.8 10.5 3.5 5.2 8.2 0.5 30.7(4%)
Meet. 39.0 36.2 35.0 44.5 43.2 57.2 10.2 265.3(33%)
Stud. 18.5 35.0 17.2 29.5 16.0 15.0 0.0 131.2(17%)
Other 1.0 2.2 0.5 4.4 7.0 3.1 0.1 18.3(2%)
Lect. 8.5 4.5 2.0 2.2 24.2 4.0 0.0 45.4(6%)
Total 104.5 114.6 113.7 136.6 138.0 163.2 22.8 793.4(100%)
ChatGPT as a Full-Stack Web Developer 203
3.5 Documentation
We proceeded to build the correct routing and to integrate the frontend and
backend together. Recognizing the need for adjustments to the database schema,
we initiated revisions while concurrently addressing application styling and adding
several popup forms.
4 Results
4.2 Code
5 Evaluation
The backend directory structure was very flat. Dividing files into different direc-
tories would have made it easier for new developers to understand the codebase.
Otherwise, file naming was good and clear.
The backend was implemented as a Node.js Express application, which is a
standard choice. The application initialization was done well and clearly followed
standard practices. There was some inconsistency in route definition, but this was
not harmful from a maintenance perspective.
208 V. Liukko et al.
In the software, all functionality took place as part of REST handlers. This is a
good option for very small applications, but in an application of this scale, a layered
architecture with services called in handlers would be a significantly better option,
especially from the perspective of not duplicating functionality in handlers.
No input validation was implemented in the handlers, which can be considered
a significant deficiency. This allowed, among other things, the entry of arbitrary
values from the domain perspective, such as negative week numbers and the like.
In terms of database usage, pagination had not been implemented, which is
very likely to cause problems for system operation as data volumes grow. Database
queries were simple and were parameterized according to good practice. Transac-
tions were not implemented, but there was little need for them in the functionality.
Poor implementation of query execution was exemplified by the /api/drafts/:id/full
implementation, where three separate queries were executed unnecessarily, even
though a single query could have achieved the same result. Since queries were not
made within a transaction, this could potentially lead to inconsistent responses.
Error Handling
Error handling in handlers was implemented so that calls within async functions
were wrapped in try-catch structures. The standard practice would be to implement
a separate error-handling wrapper in which handler functions would be wrapped.
Doing this would make the code clearer to read and avoid situations where an error
is not caught, and the response object’s send function is not called due to oversight.
ChatGPT as a Full-Stack Web Developer 209
Maintainability
The user interface was implemented using the React UI library, a common and
safe choice. The user interface was partially translated; some components had hard-
coded Finnish parts that should be translated.
The file and directory structure was better organized than that of the backend
system. However, there was some inconsistency as calls to the backend system were
found both in the components and in a separate Service directory. State management
between components was handled with callbacks, and there was no single source of
truth.
No tests had been implemented for the user interface, leading to buggy function-
alities. For example, application forms didn’t support error handling or validation,
and the system couldn’t distinguish between the lack of a response from the backend
and missing information.
GPT-4 could produce test templates for components, so this deficiency was not
mainly caused by the AI tool used.
Based on manual testing, the appearance was satisfactory from a B2B perspec-
tive. The display did not scale correctly for mobile devices, but using such a system
on a small mobile device would be atypical.
Regarding styles, both individual CSS files corresponding to each component and
the styled-components library were used. A more consistent implementation would
have been preferable. However, such inconsistency is not uncommon in human-
implemented systems. The source code included a style guide, but it was not
followed.
Many functions had been implemented without connecting to the backend
system. However, in a mock sense, the user interface was suitable for preliminarily
demonstrating how it would work and could serve as a basis for further development.
210 V. Liukko et al.
The quality of the software was also examined using static analysis tools. Static
analysis refers to analyzing software without executing the program. For this
purpose, we used the SonarQube product [28], which is widely used in software
production. Its users include IBM, Microsoft, and NASA.
The analysis from SonarQube provided observations on reliability (bugs), main-
tainability (code smells), security (security hotspots), code coverage, and code
duplications. The observations for bugs and maintainability are graded as either
blocker, critical, major, minor, or info. Security hotspot observations are categorized
into high, medium, and low levels.
Regarding bugs, the analysis identified two. A major-level bug was related to a
component in the user interface named “Error.” While “Error” is not a reserved word
per se, it’s a distinct identifier that should not be used as a variable name.
A minor-level observation was related to a style guide that had an HTML table
without a table description. This omission is an accessibility issue. Since this
observation is related to development guidelines, it is not considered as serious as if
the observation were in the actual application user interface code.
There were 43 observations related to maintainability. Out of these, 31 are of
major level, and 12 are of minor level. A total of 12 major observations were about
variables in the code that are redundant, meaning they are assigned a value without
being used. 11 major observations related to iteration where the array index is used
as the React component’s key value. The key value should be stable, which an array
index is not. In the following example, instead of the index, some unique identifier
of the criterion should be used as the key value.
<TableBody>
{criteria.map((criterion, index) => (
<TableRow key={index}>
...
<TableCell>{criterion}</TableCell>
</TableRow>
))}
</TableBody>
Other major observations included three where optional chaining in JavaScript
was not used, i.e., accessing object properties with the question mark notation. The
ChatGPT as a Full-Stack Web Developer 211
The evaluation found that the backend, implemented in Node.js Express, exhibited
a flat directory structure that could hinder new developers but was otherwise well-
organized. While the backend was more feature-rich compared to the frontend, it
lacked critical aspects such as access control and input validation. Test coverage was
generally good at 75%, although some areas were not tested. Tests were found to be
fragile due to direct database dependency. Error handling and route definitions also
had room for improvement, while the system was found to be lacking in layered
architecture and transaction management. The frontend, built with React, had a
better-organized directory structure but exhibited inconsistencies in backend calls
and state management. No UI tests were implemented, leading to bugs and a lack of
error handling. The interface was partially localized, and although it did not scale
well for mobile, it was deemed satisfactory for B2B use. Inconsistencies in styling
were noted, but these are not uncommon in human-coded systems.
Both the frontend and backend could benefit from further refinement, particularly
in validation, error handling, and test implementation. However, when tasked
with specific improvements such as input validation, GPT-4 was able to perform
212 V. Liukko et al.
adequately. The most serious error was found in React implementation and can be
only caught by an experienced developer.
Static analysis of the software was performed using SonarQube, a widely
recognized tool in the industry. The analysis focused on multiple dimensions
including reliability (bugs), maintainability (code smells), and security (security
hotspots), as well as code coverage and duplications. Importantly, the analysis
did not find any vulnerabilities. Two bugs were identified. The maintainability
observations were more numerous, with 43 in total: 31 were major and 12 were
minor. Many of these related to redundant variables and unstable key values in
React component iteration. Overall, the static analysis highlighted several areas for
improvement in code quality and maintainability.
6 Discussion
Our empirical data suggest that the effectiveness of using ChatGPT in developing a
software system is significantly influenced by the design of the prompts. A majority
of the observed limitations appear to emanate from a lack of context or insuffi-
ciently specific prompts, echoing findings from previous literature highlighting the
importance of prompt engineering in natural language processing tasks [17, 36].
Monteiro et al. [21] found similarly that developers with strong expertise in specific
technologies and frameworks were more effective in guiding ChatGPT for tasks like
bug fixing.
The ability of ChatGPT to generate useful and accurate outputs, such as
validation logic for routers, underscores its potential utility. However, this raises
the question of how much expertise is required in crafting the prompts. More
experienced developers might elicit more sophisticated and contextually appropriate
responses from ChatGPT, a key factor for companies and researchers contemplating
the integration of ChatGPT into their development pipelines.
Our study also highlights several practical implications. The integration of Chat-
GPT into the software development workflow revealed key benefits and challenges,
offering valuable insights for practitioners. Initially, ChatGPT clearly expedited the
initial development stages, reducing the time required for initial project setup and
swiftly generating foundational code structures for our React-based project. This
acceleration was particularly notable in the creation of boilerplate code and standard
features, such as setting up RESTful API routes using Node.js.
However, ChatGPT’s limitations became apparent in more complex tasks. One
notable issue was its inconsistent handling of complex database interactions, often
requiring manual adjustments, particularly with intricate SQL queries and database
schema designs. This highlighted the need for human oversight in such areas.
In terms of documentation, ChatGPT’s assistance proved to be a double-
edged sword. While it efficiently populated many sections of the reports, the
generated content sometimes lacked the nuanced understanding necessary for
complex technical topics. This is in line with findings of others where AI-assisted
ChatGPT as a Full-Stack Web Developer 213
7 Conclusions
References
1. Ahmad, A., Waseem, M., Liang, P., Fehmideh, M., Aktar, M.S., Mikkonen, T.: Towards human-
bot collaborative software architecting with ChatGPT (2023). arXiv preprint arXiv:2302.14600
2. Bourne, S.: Bourne shell. https://en.wikipedia.org/wiki/Bourne_shell. Accessed 6 Nov 2023
3. Chai - a BDD / TDD assertion library. https://www.chaijs.com/. Accessed 6 Nov 2023
4. ChatGPT. https://chat.openai.com/ (2023). Accessed 6 April 2023
5. cloc - Count lines of Code. https://github.com/AlDanial/cloc (2023). Accessed 13 April 2023
6. CSS - Cascading Style Sheets. https://www.w3.org/TR/CSS/#css. Accessed 6 Nov 2023
7. Dockerfile - Docker instruction file. https://docs.docker.com/. Accessed 6 Nov 2023
8. Dong, Y., Jiang, X., Jin, Z., Li, G.: Self-collaboration code generation via ChatGPT (2023).
arXiv preprint arXiv:2304.07590
9. Express - a back end web application framework. https://expressjs.com/. Accessed 6 Nov 2023
10. Gruber, J.: Markdown - markup language. https://daringfireball.net/projects/markdown/.
Accessed 6 Nov 2023
11. Hilma - Public procurement. https://www.hankintailmoitukset.fi/en/ (2023). Accessed 31
March 2023
12. Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., Wang, H.:
Large language models for software engineering: A systematic literature review (2023). arXiv
preprint arXiv:2308.10620
13. HTML - HyperText Markup Language. https://html.spec.whatwg.org/. Accessed 6 Nov 2023
14. JavaScript - programming language. https://en.wikipedia.org/wiki/JavaScript. Accessed 6 Nov
2023
15. JSON (JavaScript Object Notation) - a lightweight data-interchange format. https://www.json.
org/json-en.html. Accessed 6 Nov 2023
16. Kashefi, A., Mukerji, T.: ChatGPT for programming numerical methods (2023). arXiv preprint
arXiv:2303.12093
17. Liu, Y., Du, H., Niyato, D., Kang, J., Cui, S., Shen, X., Zhang, P.: Optimizing mobile-edge
AI-generated everything (AIGX) services by prompt engineering: Fundamental, framework,
and case study (2023). arXiv preprint arXiv:2309.01065
ChatGPT as a Full-Stack Web Developer 215
18. Louridas, P.: Static code analysis. IEEE Softw. 23(4), 58–61 (2006)
19. MIT Licence - a permissive free software license. https://en.wikipedia.org/wiki/MIT_License.
Accessed 6 Nov 2023
20. Mocha - a JavaScript test framework. https://mochajs.org/. Accessed 6 Nov 2023
21. Monteiro, M., Branco, B.C., Silvestre, S., Avelino, G., Valente, M.T.: End-to-end software
construction using chatgpt: An experience report (2023). arXiv preprint arXiv:2310.14843
22. MySQL - an open-source relational database management system. https://www.mysql.com/.
Accessed 6 Nov 2023
23. Node.js - a cross-platform, open-source server environment. https://nodejs.org/en. Accessed 6
Nov 2023
24. OpenAI: GPT-4 Technical Report (2023). arXiv preprint arXiv:2303.08774
25. React - an open-source front-end JavaScript library. https://react.dev/. Accessed 6 Nov 2023
26. Rosqvist, T., Koskela, M., Harju, H.: Software quality evaluation based on expert judgement.
Softw. Qual. J. 11, 39–55 (2003)
27. Solita Company. https://www.solita.fi/en/company/ (2023). Accessed 29 Sept 2023
28. SonarQube. https://www.sonarsource.com/products/sonarqube/ (2023). Accessed 29 Sept
2023
29. SQL - Structured Query Language. https://en.wikipedia.org/wiki/SQL. Accessed 6 Nov 2023
30. Sten, H., Ahtee, T., Poranen, T.: Evaluation of students’ capstone software development
projects. In: SEFI Annual Conference, pp. 531–540 (2018)
31. Surameery, N.M.S., Shakor, M.Y.: Use Chat GPT to solve programming bugs. Int. J. Inf.
Technol. Comput. Eng. 3(01), 17–22 (2023). ISSN: 2455-5290
32. SVG - Scalable Vector Graphics. https://www.w3.org/Graphics/SVG/. Accessed 6 Nov 2023
33. Treude, C.: Navigating complexity in software engineering: A prototype for comparing GPT-n
solutions (2023). arXiv preprint arXiv:2301.12169
34. VTP - Source code repository for the Valvontatyöpöytä. https://github.com/AI-Makes-IT/VTP
(2023). Accessed 29 Sept 2023
35. Waseem, M., Das, T., Ahmad, A., Fehmideh, M., Liang, P., Mikkonen, T.: Using chatgpt
throughout the software development life cycle by novice developers (2023). arXiv preprint
arXiv:2310.13648
36. White, J., Hays, S., Fu, Q., Spencer-Smith, J., Schmidt, D.C.: ChatGPT prompt patterns for
improving code quality, refactoring, requirements elicitation, and software design (2023).
arXiv preprint arXiv:2303.07839
37. YAML - a human-readable data serialization language. https://yaml.org/. Accessed 6 Nov 2023
Part IV
Generative AI in Software Engineering
Processes
Transforming Software Development
with Generative AI: Empirical Insights
on Collaboration and Workflow
1 Introduction
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 219
A. Nguyen-Duc et al. (eds.), Generative AI for Effective Software Development,
https://doi.org/10.1007/978-3-031-55642-5_10
220 R. Ulfsnes et al.
as software, data, and artificial intelligence [5], forcing a shift in the use of
strategic frameworks [30], and new ways of developing technology for highly
skilled employees with intelligent technology [36].
Technology for assisting developers with writing code, particularly using Inte-
grated Development Environments (IDEs), is not a new concept [17, 26]. The task
of autocompletion of code, and generating of test, and various other tasks has been
particularly interesting for software, as natural language matches quite well as the
software code is hypothesized to be a natural language [16]. Subsuquently, this
hypothesis has led to lots of research on artificial intelligence (AI) for software
engineering [32, 33]. With the introduction of generative artificial intelligence
(GenAI)—a type of artificial intelligence (AI)—both the software development
processes and tooling have started to change fast. Further, GenAI can revolutionize
software development by automating repetitive tasks, improving code quality,
enhancing collaboration, providing data-driven insights, and ultimately accelerating
the development life cycle [25, 31]. There is a growing research into how to use
Copilot [28] or generative AI systems such as ChatGPT [37] and its capability to
automate software engineering tasks [20]. However, good tooling is not enough.
In order for a company to succeed with software product development, well-
working teams and good processes are key. Software engineering is a social activity
that is focused on close cooperation and collaboration between all team members
[21] and across teams in the organization [2]. Therefore, it is important to note
that while AI has great potential, it also comes with challenges [1] such as ethical
considerations, data privacy concerns, the need for skilled professionals to handle
the technology within software teams, and a potential change in the team dynamics.
However, research on team dynamics is lacking. In order to understand the effects
on team-dynamics, we also need to consider the individual work practices, to grasp
the effects on a team level. This chapter explores how are software engineers’ work
practices transformed and potential impact on the transformation on collaboration.
We have interviewed 13 data scientists, managers, developers, designers, and
front-end developers to investigate how they use GenAI technology and how their
workday has changed. Finally, we discuss how this technology might affect software
development teamwork.
2 Related Work
and perception. Value creation is about whether or not the developers feel they are
creating something, and the factor has six sub-factors. The second factor, efficient
use of time, has two sub-factors, meeting expectations and the ability to work
focused. In essence, the assessment of a workday being good or bad is largely
influenced by the expectations for the day. For example, if one anticipates a day
filled with meetings, the day can be considered good even if most of the time is
spent in meetings. However, if one hopes for a day of focused work and the day is
filled with meetings, the day is perceived as a bad workday. Coworker interruptions
were specifically described as negatively influencing developers’ ability to focus or
work as planned, although being able to help a coworker was generally considered
positive and rewarding. Lastly, perception is about how they perceive their own
productivity.
Developer satisfaction and work productivity are related; therefore, they need
to be key considerations for software companies [15]. More productive developers
may be more satisfied, and more satisfied developers may be more productive.
Autonomy, being able to complete tasks, and technical skills all affect productivity.
By introducing new technology like GenAI, a team member’s productivity may
be positively affected. At the same time, work culture and team collaboration are
important for job satisfaction. An increased reliance on tools like GenAI may
enhance individual productivity while inadvertently reducing inter-team interac-
tions, ultimately affecting long-term job satisfaction and collective productivity.
Introducing GenAI in software teams is therefore a balancing act.
This study was conducted in the context of two research programs on software
development processes, where several companies introduced generative artificial
intelligence (GenAI) in their product development process. GenAI, especially those
built on LLMs, is a new phenomenon that has not been previously studied. Due to
the uncertain nature of the phenomenon, we chose an exploratory multi-case study
[38]. We selected our informants using snowball sampling [3] in Slack asking for
subjects that used GenAI for a wide range of activities. As of the tools used, our
studies found that ChatGPT and GitHub Copilot were the most common for code
and text, while some reported that they used Midjourney and DALL-E 2 for image
creation.
Transforming Software Development with GenAI 223
organizations are using it based on their own preferences. GenAI tools for a wide
range of activities. See Table 2 for an overview of such activities.
The type of GenAI activities and utilization depends on individual preferences,
the task to be solved, and the user’s role in the organization. Developers typically
use GenAI when working with the source code, while managers use it for, e.g.,
organizing workshops or creating content for PowerPoint presentations.
When a person was stuck on a particular problem or did not know how to proceed,
ChatGPT was used as an assistant or fellow team member, where interacting with
it using chat could help a person solve complex problems or get increased progress.
For non-technical problems, this could be a case of writer’s block, formulations, or
when they are zoning out: "For me, the main thing is to get unstuck, whether I am
struggling with writer’s block or formulations, just by interacting with ChatGPT
and getting an immediate response is something else."
This highlights that there is an effect of just having the chat window open and
getting feedback without interrupting others. Formulating the problem to ChatGPT
made it easier to keep focus on the task, helped on the thought processes, and
helped see the problem in a new light. Developers referred to such interaction with
ChatGPT as rubber-ducking. The idea of rubber-ducking is to explain the problem
one seeks to solve to an inanimate object (e.g., a rubber duck), in an attempt to
achieve a deeper understanding of the problem and a potential solution through the
process of explaining it to someone (or something) using natural language.
Transforming Software Development with GenAI 225
Using GenAI as a sparring partner was both faster than asking human colleagues
and also took away the feeling of disturbing them in their work. Further, being able
to formulate the question as you would to a human felt easier than the alternate
Google search, where you need to consider the specific keywords and what results
they can give you.
4.2 Learning
Creating a virtual environment for a product was used to develop software for a
trading platform. They engaged with ChatGPT and asked it to simulate that it was a
stock exchange. They then provided ChatGPT with information about which stocks
could be traded at the exchange and asked it to simulate different trading scenarios.
This provided a novel way for the informant to understand the intricacies of a
stock exchange. Further, this means that ChatGPT had the context for the particular
trading platform the informants were interested in.
And then I said, “now I’m going to make an application out of this in such-
and-such language.” And so it has the context for everything while I kept asking
it further questions. This integrated approach to both understanding a domain also
then produced the relevant context that ChatGPT could use to generate relevant
code.
226 R. Ulfsnes et al.
4.4 Copywriting
Maybe not surprisingly, ChatGPT was used to assist in copywriting text, especially
useful when integrated into the tool the informant was using, getting live feedback;
this was particularly useful for persons that were not native or fluent in English or
Norwegian. However, some experienced that GenAI was not very useful for email
and text writing for two main reasons: there was a significant overhead in engaging
sufficiently with ChatGPT to create emails, and the quality did not get better.
“I think I have asked it to write emails, but for me it is just faster to write it
myself. The formulation was better though.”
Getting started with a relatively novel coding project in any company requires quite
a lot of boilerplate code; this type of code does not add functionality relevant to the
business case but is required to get the project up and running with the necessary
declarations and structures. Both back-end and front-end developers used ChatGPT
to create tailored boilerplate code:
If I have a task, to create a list of tricks with something, and thumbs up and
thumbs down on each element, for instance. I often start by describing what I want
to ChatGPT. Then, it writes the code for me. GenAI was also used for repetitive
or tedious non-technical tasks. For example, managers and architects stated to use
ChatGPT to consolidate text used for production of bid to customers. One manager
explained, A bid I would normally have spent a lot of time in writing, I only spent 20
minutes on. Previously I would have spent a lot of time, looking for previous bids,
adapting it and merging it. It is terrible to say it out loud [laughing], as this kind of
is in someway reducing the need for my work. This type of work is what the company
pays me to do.
Another example is when you are building up technical specifications and
technical architectures where the style of the text is quite consistent but the content
varies between use cases or getting feedback on emails and getting a head start on
the writing of the text.
This was the most common activity among software developers. GenAI was used
on many different tasks, ranging from refactoring or simplifying code, code review,
translating code from one programming language to another, and simply explaining
the code. Testing was also well suited for GenAI utilization; given its repetitive
nature, GenAI was used to create numerous tests for the code. The informants also
Transforming Software Development with GenAI 227
noted that the generated tests sometimes accounted for scenarios and test cases that
they themselves had not thought of.
It was mostly a matter of thinking up all the things that could go wrong and
creating unit tests for them. And that’s where CoPilot was brilliant, as it came up
with things that could go wrong that I had never thought of.
In the previous section, we described how GenAI is used for a variety of activities.
In the studied companies, GenAI was becoming an integrated part of their daily
work, and most explained that they used GenAI daily or “all the time.”
Among the study participants, we found two styles of interaction: simple
dialogue and advanced dialogue extended with prompt engineering. The interaction
style depended on the work context and types of problems to be solved. Table 3
contains an overview of the effects and drawbacks from interacting with GenAI.
5.1 Effects
By spending less time on manual and repetitive tasks, the improved productivity
brought more enjoyment, motivation, and fun to the work. The repetitive tasks were
seen as menial, and not particularly mentally challenging. Further, as time was
freed up, more time could be spent on creative and challenging tasks. Moreover,
engaging with the GenAI itself was experienced as fun and increased the motivation
to experiment with different applications of the new technology.
Interacting through dialogue with ChatGPT increased engagement. It was expe-
rienced as a more “natural” engagement then searching for answers to problems
on Google. Having a dialogue with ChatGPT was also described as faster than
concocting the necessary string of Google search keywords. Moreover, ChatGPT
responds immediately with the, assumed, correct answer to the question while
googling often required additional steps, vetting the correct site on the search page,
entering the particular page, and analyzing the Web page for the potential answer
to the question. One informant was so conscious about speed that they deliberately
chose GPT v3.5 over v4 in certain cases (at the time of our data collection, v3.5 was
faster than v4), where the precision and quality of the v3.5 answer was assumed to
be sufficient.
GenAI’s utility also extended beyond quick and precise answers, with informants
reporting a freedom in interacting with an artificial tool rather than having to deal
with the social considerations involved in asking a team member.
Yeah, so you don’t need to be too polite either. You don’t have to have the correct
phrasing or anything. You can just throw something out, I feel. Then you can get an
answer, and if it’s not quite right, you can refine the question again.
The threshold of asking ChatGPT was significantly lower than asking another
person or in a Slack channel. This threshold for asking colleagues could potentially
be high in a busy work environment, as one does not wish to interrupt an already-
busy colleague. Further, you get feedback immediately, while it might take a while
to get feedback on Slack.
5.2 Challenges
While there are many positive benefits of using GenAI, there are also challenges.
The elusiveness of data confidentiality, data policy, and sensitivity meant that every-
one was acutely conscious about which data to input into the chat interface. This
made the work process somewhat awkward, requiring cleansing and anonymization
of the text being sent into ChatGPT. Developers in companies using open-source
technology were less lenient in protecting code than those in companies with
internal code repositories. Moreover, the general lack of tool integration meant
that there was a substantial amount of copy-paste to move text and code between
different windows. One developer using Copilot X reported that the integration in
the IDE meant that the code could be autocompleted and explained by ChatGPT in
a seamless process, which reduced their work immensely: I think it would have been
easier to adopt a GenAI tool if I had used something like Copilot. Because then it
would have been, in a way, integrated into the workflow.
Transforming Software Development with GenAI 229
With regard to the output, all informants noted that the content produced by
GenAI, regardless of tool, seldom represented a final product and typically required
further refinement to be applicable in a real-world context. The general attitude from
the interviewees was that they expected the output to be wrong.
Regarding technological development, which is characterized by a rapid pace
and an increasing number of available libraries and technology, the cutoff date for
ChatGPT’s training data in September 2021 represented a significant drawback,
where the error rate was annoyingly high.
One architect creating project startup documents experienced that ChatGPT was
culturally biased toward how more hierarchical companies would perform activities
in a project. This meant that ChatGPT had to be prompted with specific information
regarding the methodology and project practices:
You kind of have to trick it into the right context if it’s (GenAI) going to be part
of agile processes.
Several interviewees talked about how the quality of their prompt affected the
quality of the response and how the use of prompt engineering techniques like
contextualizing the problem, using personas, etc. guide and steer the dialogue with
ChatGPT. Prompt engineering was applied to all the activities in Table 2. One
informant explained asking ChatGPT to create a description of the most critical
code reviewer in the world. They then told ChatGPT to act like this description
while reviewing the code in a pull request. Another more technical aspect was
telling ChatGPT to act like an SQL database to test queries. The effect of using
prompt engineering was seen as a matter of precision and quality, thus reducing the
time spent on working on modifying the output. One explained the usage of prompt
engineering as follows:
It’s like putting up fences on the bowling lane and then narrowing it down even
more. It can almost only go one way, and that’s a strike.
An important prompt engineering technique was assigning different roles to
ChatGPT for the same question to get more than one perspective or answer on a
problem. One explained:
“I want you to respond like a wealth manager,” “I want you to respond like a
friend,” or “like a so-and-so...” And then you get different answers.
Using prompt engineering while writing code was described as feeling similar
to programming with a partner. The flip side was that the developers mentioned
that they were doing less pair programming as they were getting the wanted rubber
ducking effect from using GenAI.
230 R. Ulfsnes et al.
6 Discussion
One of the notable consequences of integrating GenAI tools into software devel-
opment tasks is a visible shift in collaborative communication dynamics. Some of
the informants appear to have a growing inclination to consult AI-driven solutions
for issues and tasks they previously discussed with their human colleagues. This
shift can have dramatic effects on the team’s ability to perform. According to Liu et
al. [19], a team’s ability to perform is highly dependent on the knowledge sharing
of the team. This implies that reducing knowledge sharing by replacing this with
generative should be observed. This opposes the findings by Brynjolfsson et al.
[7], with knowledge dissemination between high-skilled and low-skilled workers
in customer service. These findings point to a significant difference between work
done by teams in software development and individual work in customer service.
However, we find that individuals become more efficient and save time, which they
spend on more rewarding tasks.
If team members reduce their interactions in favor of focusing on individual
tasks, a phenomenon known as an isomorphic team structure may emerge. The
advantages of this structure are that it is organizationally simple, allows many
tasks to be completed in parallel, and can clearly define and understand task
responsibilities. However, the effect of such a structure is that the developers focused
on their own modules and often created their own plans and made their own
decisions. In addition, problems are seen as personal, individual goals are more
important than team goals, and team members become less aware of what others
are doing and get less support and help from others [23]. In a good working team,
learning is a continuous feedback (see Fig. 1a). By introducing GenAI, this loop
will be disrupted or reduced (Fig. 1b), thus reducing teamwork performance[23].
Additionally, this can also contribute to making persons less satisfied as helping
others is a key factor for good workdays [22].
This model posits that the incorporation of GenAI in software development
may disrupt the established learning loop. Such disruption will subsequently affect
individual and team performance in software development. While it is anticipated
that GenAI might enhance individual performance by streamlining tasks, there is a
concurrent risk of diminishing overall team performance.
However, everything is not dark; as multiple informants noted, they had an
increasing amount of knowledge sharing on how to use GenAI in their work and
context. Notably, the practice of “pair prompt engineering” has emerged, akin to
the concept of pair programming. This approach facilitates knowledge sharing [35]
both on-site and when working remotely. This can thus involve a shift in how the
programmers program, creating yet another abstraction layer for code production.
Transforming Software Development with GenAI 231
(a)
(b)
Fig. 1 Regular and disrupted learning loops. (a) Regular learning loop. (b) Disrupted learning
loop
7 Concluding Remarks
In essence, GenAI serves a dual purpose: making everyday tasks more efficient and
reigniting creative thinking for leaders and developers. By automating the produc-
tion of routine code snippets and related tasks, these tools enable programmers
to focus on higher-level conceptualization and innovation, resulting in enhanced
productivity and code quality. This is similar to findings by Meyer et al. [22],
where good workdays are understood as days where they feel productive and are
able to work focused. In addition, similar to Brynjolfsson et al.’s findings where
there is knowledge dissemination through the GenAI [7], we observe that there are
data scientists using GenAI for coding purposes and front-end developers getting
assistance in back-end development. Both programmers and leaders acknowledged
the potential of generative AI in freeing up valuable time and cognitive resources
that could be better allocated to more creative and complex problem-solving tasks.
232 R. Ulfsnes et al.
Acknowledgments This work was supported by the Research Council of Norway grants 321477
and 309344 and the companies Knowit and Iterate through the research projects Transformit and
10XTeams.
References
1. Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dangers of stochastic
parrots: Can language models be too big? In: Proceedings of the 2021 ACM Conference on
Fairness, Accountability, and Transparency, pp. 610–623 (2021)
2. Berntzen, M., Stray, V., Moe, N.B., Hoda, R.: Responding to change over time: A longitudinal
case study on changes in coordination mechanisms in large-scale agile. Empir. Softw. Eng.
28(5), 114 (2023). https://doi.org/10.1007/s10664-023-10349-0
3. Biernacki, P., Waldorf, D.: Snowball sampling: problems and techniques of chain referral
sampling. Sociol. Methods Res. 10(2), 141–163 (1981). Publisher: Sage Publications Sage
CA: Los Angeles, CA
4. Bird, C., Ford, D., Zimmermann, T., Forsgren, N., Kalliamvakou, E., Lowdermilk, T., Gazit, I.:
Taking flight with copilot: early insights and opportunities of AI-powered pair-programming
tools. Queue 20(6), 35–57 (2022). https://doi.org/10.1145/3582083. https://dl.acm.org/doi/10.
1145/3582083
5. Bosch, J., Olsson, H.H.: Digital for real: a multicase study on the digital transformation of
companies in the embedded systems domain. J. Softw. Evol. Process. 33(5) (2021). https://doi.
org/10.1002/smr.2333
6. Bruch, M., Monperrus, M., Mezini, M.: Learning from examples to improve code completion
systems. In: Proceedings of the 7th Joint Meeting of the European Software Engineering
Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering,
pp. 213–222 (2009)
7. Brynjolfsson, E., Li, D., Raymond, L.R.: Generative AI at work. Tech. rep., National Bureau
of Economic Research (2023)
8. Chatterjee, P., Kong, M., Pollock, L.: Finding help with programming errors: An exploratory
study of novice software engineers’ focus in stack overflow posts. J. Syst. Softw. 159, 110454
(2020). Publisher: Elsevier
9. Corbin, J., Strauss, A.: Basics of Qualitative Research: Techniques and Procedures for
Developing Grounded Theory. Sage Publications, Thousand Oaks (2014)
10. Dell’Acqua, F., McFowland, E., Mollick, E.R., Lifshitz-Assaf, H., Kellogg, K., Rajendran, S.,
Krayer, L., Candelon, F., Lakhani, K.R.: Navigating the Jagged Technological Frontier: Field
Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality
(2023). https://doi.org/10.2139/ssrn.4573321. https://papers.ssrn.com/abstract=4573321
11. Dickinson, T.L., McIntyre, R.M.: A conceptual framework for teamwork measurement. In:
Team Performance Assessment and Measurement, pp. 31–56. Psychology Press, London
(1997)
12. Dingsøyr, T., Bjørnson, F.O., Shull, F.: What do we know about knowledge management? Prac-
tical implications for software engineering. IEEE Softw. 26(3), 100–103 (2009); Conference
Name: IEEE Software. https://doi.org/10.1109/MS.2009.82
Transforming Software Development with GenAI 233
13. Fitzgerald, B., Stol, K.J.: Continuous software engineering: a roadmap and agenda. J. Syst.
Softw. (2015). http://doi.org/10.1016/j.jss.2015.06.063
14. Florea, R., Stray, V.: A global view on the hard skills and testing tools in software testing. In:
2019 ACM/IEEE 14th International Conference on Global Software Engineering (ICGSE), pp.
143–151. IEEE (2019). https://doi.org/10.1109/ICGSE.2019.00035
15. Forsgren, N., Storey, M.A., Maddila, C., Zimmermann, T., Houck, B., Butler, J.: The SPACE
of developer productivity: there’s more to it than you think. Queue 19(1), 20–48 (2021)
16. Hindle, A., Barr, E.T., Gabel, M., Su, Z., Devanbu, P.: On the naturalness of software. Commun.
ACM 59(5), 122–131 (2016); Publisher: ACM New York, NY, USA
17. Kersten, M., Murphy, G.C.: Using task context to improve programmer productivity. In:
Proceedings of the 14th ACM SIGSOFT International Symposium on Foundations of Software
Engineering, pp. 1–11 (2006)
18. Li, A., Endres, M., Weimer, W.: Debugging with stack overflow: Web search behavior in novice
and expert programmers. In: Proceedings of the ACM/IEEE 44th International Conference
on Software Engineering: Software Engineering Education and Training, pp. 69–81. ICSE-
SEET ’22. Association for Computing Machinery, New York (2022). https://doi.org/10.1145/
3510456.3514147. https://dl.acm.org/doi/10.1145/3510456.3514147
19. Liu, M.L., Hsieh, M.W., Hsiao, C., Lin, C.P., Yang, C.: Modeling knowledge sharing and
team performance in technology industry: the main and moderating effects of happiness. Rev.
Manag. Sci. 14(3), 587–610 (2020). https://doi.org/10.1007/s11846-018-0301-4
20. Melegati, J., Guerra, E.: Dante: A taxonomy for the automation degree of software engineering
tasks (2023). arXiv
21. Mens, T., Cataldo, M., Damian, D.: The social developer: the future of software development
[Guest Editors’ Introduction]. IEEE Softw. 36(1), 11–14 (2019). Publisher: IEEE
22. Meyer, A.N., Barr, E.T., Bird, C., Zimmermann, T.: Today was a good day: the daily life of
software developers. IEEE Trans. Softw. Eng. 47(5), 863–880 (2019). Publisher: IEEE
23. Moe, N.B., Dingsøyr, T., Dybå, T.: A teamwork model for understanding an agile team: a case
study of a Scrum project. Inf. Softw. Technol. 52(5), 480–491 (2010). Publisher: Elsevier
24. Moe, N.B., Stray, V., Smite, D., Mikalsen, M.: Attractive workplaces: what are engineers
looking for? IEEE Softw. pp. 1–8 (2023)
25. Muller, M., Ross, S., Houde, S., Agarwal, M., Martinez, F., Richards, J., Talamadupula,
K., Weisz, J.D.: Drinking chai with your (AI) programming partner: A design fiction about
generative AI for software engineering. In: HAI-GEN Workshop at IUI 2022: 3rd Workshop on
Human-AI Co-Creation with Generative Models (2022). https://hai-gen.github.io/2022/papers/
paper-HAIGEN-MullerMichael.pdf
26. Murphy, G., Kersten, M., Findlater, L.: How are Java software developers using the Eclipse
IDE? IEEE Softw. 23(4), 76–83 (2006). Conference Name: IEEE Software. https://doi.org/10.
1109/MS.2006.105
27. Ravn, J.E., Moe, N.B., Stray, V., Seim, E.A.: Team autonomy and digital transformation:
disruptions and adjustments in a well-established organizational principle. AI & SOCIETY
37(2), 701–710 (2022)
28. Ross, S.I., Martinez, F., Houde, S., Muller, M., Weisz, J.D.: The programmer’s assistant: Con-
versational interaction with a large language model for software development. In: Proceedings
of the 28th International Conference on Intelligent User Interfaces, pp. 491–514 (2023). https://
doi.org/10.1145/3581641.3584037
29. Saldaña, J.: The Coding Manual for Qualitative Researchers, 2nd edn. SAGE, Los Angeles,
(2013). oCLC: ocn796279115
30. Stray, V., Gundelsby, J.H., Ulfsnes, R., Brede Moe, N.: How agile teams make objectives and
key results (OKRs) work. In: Proceedings of the International Conference on Software and
System Processes and International Conference on Global Software Engineering, pp. 104–109
(2022)
31. Sun, J., Liao, Q.V., Muller, M., Agarwal, M., Houde, S., Talamadupula, K., Weisz, J.D.:
Investigating explainability of generative AI for code through scenario-based design. In: 27th
International Conference on Intelligent User Interfaces, pp. 212–228 (2022)
234 R. Ulfsnes et al.
32. Svyatkovskiy, A., Zhao, Y., Fu, S., Sundaresan, N.: Pythia: Ai-assisted code completion
system. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining, pp. 2727–2735 (2019)
33. Talamadupula, K.: Applied AI matters: AI4Code: applying artificial intelligence to source
code. AI Matt. 7(1), 18–20 (2021). https://doi.org/10.1145/3465074.3465080. https://dl.acm.
org/doi/10.1145/3465074.3465080
34. Tkalich, A., Ulfsnes, R., Moe, N.B.: Toward an agile product management: What do product
managers do in agile companies? In: International Conference on Agile Software Development,
pp. 168–184. Springer (2022)
35. Tkalich, A., Moe, N.B., Andersen, N.H., Stray, V., Barbala, A.M.: Pair programming practiced
in hybrid work. In: Proceedings of the 17th ACM/IEEE International Symposium on Empirical
Software Engineering and Measurement (ESEM). Association for Computing Machinery, New
York (2023)
36. Ulfsnes, R., Mikalsen, M., Sporsem, T., Hatling, M.: Technology for knowledge work: A
relational perspective. In: ECIS 2023 Research-in-Progress Papers (2023). https://aisel.aisnet.
org/ecis2023_rip/48
37. White, J., Hays, S., Fu, Q., Spencer-Smith, J., Schmidt, D.C.: Chatgpt prompt patterns for
improving code quality, refactoring, requirements elicitation, and software design (2023).
arXiv. https://doi.org/10.48550/arxiv.2303.07839
38. Yin, R.K.: Case Study Research and Applications: Design and Methods. Sage, Los Angeles
(2018)
How Can Generative AI Enhance
Software Management? Is It Better Done
than Perfect?
Abstract Software development teams often deviate from their adopted frame-
work, such as Scrum, and these deviations can sometimes bring consequences with
different impact levels if the adaptations are not tailored for the specific teams’
needs and circumstances. For instance, agile developers sometimes oversimplify
crucial Agile steps, such as estimating needed effort for a specific task or lack
of explicit assessment of the criteria for “Definition of Done.” This information,
though, is useful for subsequent planning activities. We hypothesise that generative
AI could be used to help Agile teams conduct a number of software management
tasks in a systematic and effective way. A family of experiments to compare the
performance of humans and generative AI tools, namely, GPT-models and Bard, will
be conducted. The findings from these experiments will serve as a foundation for a
discussion on the role of artificial intelligence in software engineering tasks. This
discussion will primarily focus on the balance between performance (perfect?) and
efficiency (done?) and the importance of human oversight in Agile environments.
B. Cabrero-Daniel (✉)
University of Gothenburg, Gothenburg, Sweden
e-mail: beatriz.cabrero-daniel@gu.se
Y. Fazelidehkordi
University of Gothenburg, Gothenburg, Sweden
e-mail: gusfazya@student.gu.se
A. Nouri
Volvo Cars & Chalmers University of Technology, Gothenburg, Sweden
e-mail: ali.nouri@volvocars.com
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 235
A. Nguyen-Duc et al. (eds.), Generative AI for Effective Software Development,
https://doi.org/10.1007/978-3-031-55642-5_11
236 B. Cabrero-Daniel et al.
1 Introduction
2 Background
Software development methodologies are defined as a set of rules and protocols that
are applied in different stages of developing and producing a software product [8].
Agile software development is one of the most popular methodologies that have
been used by many software organizations in the past few decades. Among different
methodologies of agile processes, scrum is the most popular one used by many
companies. Scrum is defined as a practical lightweight method to control the process
1 Visit https://bard.google.com/.
How Can Generative AI Enhance Software Management? 237
of developing and delivering a software product in a more flexible way [24]. The
main advantage of scrum framework is giving the teams flexibility to adapt to the
transformations that happen during the development phase of a software that could
be due to changes in the project budget, the initial requirements, or the structure of
the development team.
Many software companies claim scrum as the main framework they use in their
workflow. However, according to Mortada et al., due to several reasons, not all
development teams are strictly faithful to the scrum guidelines, which results in
them applying the scrum practices differently from what is suggested, or they might
even not use all of the recommended practices [20]. Mortada et al. discuss several
wrong practices in scrum teams including not estimating the time and effort for each
user story, not having the correct structure for writing the user stories in the backlog,
not having a product backlog, not defining the sprint goal at the beginning of the
sprint, and not ending the sprint with demonstrating the desired deliverable [20].
Another challenge for scrum teams could be that developers often do not receive
feedback from the right colleagues or stakeholders at the end of the sprint due to
not demonstrating the results to the customers [20]. Moreover, risk assessments
and understanding priorities of the project are usually done by the business people
not developers; however, programmers need the freedom to schedule the riskiest
segments of development at first, to reduce the overall risk of the project [3],
but then, the question will be how to estimate all these risks? Furthermore, each
developer knows how many task points they managed to implement in the last
iteration, and as a result, no one signs up for more points than they have in their
personal task points budget [13, 19]. Then should AI do that?
2.2 Generative AI
What Is GenAI?
Generative AI could be used to solve tasks in many different contexts. The tasks
consist of mathematical problems, code generation, giving answers to knowledge-
intensive questions such as those in official tests (e.g., US Medical License) [17]
or sensitive questions, automating tasks, etc. [6, 26]. In this work, we focus on
the application of GenAI in work processes such as scrum practices and team
management. More precisely, we will look at these tools:
– GPT-3.5: large language model introduced by OpenAI, which can generate code
or natural language text [15].
– GPT-4: previous studies suggest that GPT-4 “is more reliable in performance
and safety, however, they are not fine-tuned for specific tasks” [15]. It is a larger
model than GPT-3.5, so it takes more time to generate responses [15].
– Bard: previous studies show it solves less problems in benchmarks, compared
to GPT-models [12]; some studies connect these results to token limits and
capabilities of conversation retention [1].
While generative AI tools are greatly used by individuals for personal purposes,
companies and organizations can leverage these tools to improve their working pro-
cesses at different stages (e.g., managerial works, interactions with customers) [14].
Managers could use generative AI tools to get recommendations for decision-
making, data transformations, and automating the interaction processes with the
customers [14]. Organization’s data could be used as input to the AI tools for the
How Can Generative AI Enhance Software Management? 239
While LLMs, such as GPT and Bard, can perform great tasks, there are concerns
regarding their robustness, transparency, and accuracy [9]. These might urge the
need for a human monitoring how they perform in their tasks and ensure that ethical
decisions are made [5]. The question then arises of whether it is worth integrating
GenAI tools. While training new workers for doing specific tasks costs a lot and
takes much time, using AI tools that can perform the same tasks in significantly
shorter time and with less costs seems to be a reasonable alternative for companies.
However, there are concerns regarding the productivity and performance accuracy
of AI-assistant tools compared to humans.
Foundation models, including LLMs, especially those implementing online
learning strategies are very sensitive to incorrect, redundant, or unstable data
distributions, which can lead to undesired replies or behaviors [5]. Moreover, there
are some concerns regarding biases when these models are trained or fine-tuned for
very specific purposes, which might make them unable to be creative even when
facing new contexts [18]. Another aspect that is concerning both for individuals
and companies is privacy of the data they have used to train the model, since these
models could be capable of recovering personal or sensitive data that could be used
for malicious purposes [18]. Therefore, it is important to be careful when inputting
sensitive data to these models. One common work-around for companies is using a
local LLM, trained to do specific tasks.
3 Research Method
4 Results
The goal of this experiment is to evaluate different two chatbots based on LLMs,
ChatGPT, and Bard, in their ability to estimate the time needed to implement a
number of unit tests. In this experiment, a human participant, agnostic to the task,
and a LLM were given a C# class and some unit tests and were asked, “How many
unit tests are missing?” To that question, the human participant replied that 6 unit
tests were missing for good coverage. Interestingly, both tools gave very similar
answers: “Approximately 5 missing unit tests.” While ChatGPT provided a few
examples of tests to implement (e.g., “computing the average salary when the team
has 0 members”), Bard listed the class method that should be tested (Table 1).
Then, all subjects, human or chatbot, were asked for an estimation of the time
needed by the participant to implement the 6 unit tests identified by them. The
participant estimated it would take them around 1 hour to implement the tests,
Bard estimated 2 or 3 hours, and ChatGPT replied that a junior developer like
them would spend between 15 and 30 minutes per test. Then, the participant was
asked to implement the missing unit tests, which were checked by a third party for
correctness. The participant reported spending 38 minutes, with each unit test taking
between 4 and 6 minutes to implement. Overall, all the subjects overestimated the
time needed to implement the tests.
In a second phase of the experiment, new methods, previously unseen by all the
subjects, were added to the classes before repeating the aforementioned process.
In the second iteration, the human participant reported 5 missing unit tests, while
ChatGPT reported 4 test cases, and Bard directly provided the code for 3. All
subjects were then asked to estimate the time needed to implement the unit tests,
but this time, the information about the work in the first round was given. While
the participant estimated it would take them 15 minutes, both ChatGPT and Bard
computed the average test time (6.33 seconds) and multiplied by 5 to get an
estimation. It is interesting to note that ChatGPT used decimals, while Bard rounded
up to 7 minutes, giving a slightly lower estimation (31 vs 35 minutes) as a result.
Both chatbots also provided tips and suggestions for the human developer.
The participant finally took approximately 20 minutes to implement the remain-
ing 5 unit tests. This experiment therefore shows that even though LLMs might
overestimate the time needed to perform a task, they can provide reasonable
suggestions if provided with contextual information. An observation from this
experiment is that when asked to implement the 10 identified unit tests, ChatGPT
took 21.81 seconds, which raises the question of whether it is worth it to have
humans writing unit tests or rather revising the AI generated ones [9].
According to Gren et al., there is a bias by which humans overestimate the time
needed to implement a system, given a requirements specification with obsolete
or deprecated requirements. This can be seen in the left of Fig. 1, representing the
estimations of bachelor students. We replicate here this experiment using OpenAI’s
gpt-3.5-turbo model and the same prompts given to humans, which can be
found in [11]. The aggregated results, on the right side of Fig. 1, show that the
aforementioned bias cannot be observed in the estimations by GPT-3.5 (Table 2).
However, it is important to note that using different prompts to ask GPT-3.5 to
complete Task C (5 requirements, 1 obsolete) leads to significantly different results.
The bottom-right plot in Fig. 1 shows the differences between multiple runs of GPT
estimations using two different prompts:
1. Your task is to estimate how long (in terms of weeks) it will take to implement
the following requirements. {{list of requirements}} Please note that
R5 should NOT be implemented [10].
2. Your task is to estimate how long (in terms of weeks) it will take to implement
the first four requirements (R1–R4) in the following requirement specification:
{{list of requirements}}
Other alternative prompts, not discussed in this chapter for it would be outside
of our scope, were also tested. As a general comment, it is important to note
that GPT performed significantly better if the requirement was flagged as obsolete
at the beginning of the prompt, not after the {{list of requirements}}.
A one-way ANOVA test reveals that the means of the groups are statistically
How Can Generative AI Enhance Software Management? 243
Fig. 1 Humans (left, from [11]) and GPT’s estimation (right) of time for specifications with
different numbers of requirements and obsolete requirements. Task C was performed with two
prompts and obtained significantly different results
The Definition of Done (DoD) is a formal description of the state of the Increment
when it meets the quality measures required for the product [23]. Only when a
244 B. Cabrero-Daniel et al.
Table 4 Percentage of positive replies for each alternative question. Each prompt was used three
times, and done was decided using the “best of three” method discussed in [22]. Each prompt was
followed by “Answer only either Yes or No”
Question GPT says done
Considering these criteria, is the current status of the task “complete?” 98.83%
Can this work item be presented at the Sprint Review? 99.80%
Can this Product Backlog item be released? 99.76%
Can this work item be considered an Increment already? 82.55%
Is it done? 69.70%
Are we done? 43.84%
Product Backlog item meets the DoD is an Increment born [23]. If a Product
Backlog item does not meet the DoD, it cannot be released or even presented at
the Sprint Review and should return to the Product Backlog for future consideration
instead [23] (Table 3).
Unless the DoD is an organizational standard, each Scrum Team must create
a DoD appropriate for the product under development [23]. However, in this
experiment, we will assume a generic DoD addressing the 11 important quality
properties of software: (i) code is written, (ii) unit tests are written and passing,
(iii) integration tests are written and passing, (iv) code was peer reviewed, (v)
documentation is updated, (vi) code was refactored (has been improved and
optimized and is maintainable), (vii) system testing is done, (viii) security review is
done, (ix) acceptance tests are done, (x) feature was delivered, and (xi) feature was
installed.
We ran a study using the OpenAI’s API for the GPT-3.5 model asking whether an
item was done or different combinations of the aforementioned 11 generic criteria
as completed. All combinations with any number of elements, where order does not
matter, without repetitions (a total of 2046) of the criteria were used. Finally, all the
combinations were tested for each of the alternative questions about the state of the
increment in Table 4.
We do not have the ground truth for this question or, rather, the human evaluation
for which out of the 2046 combinations of criteria correspond to a “done” item for
all the questions in Table 4. Strictly speaking, all except one combination (where all
criteria are met) should define a “done” item. However, as GPT points out, criteria
are interrelated (e.g., passing unit tests assumes the code is written), so the human
decision-making process is often more complex than a checklist. The goal of this
experiment, though, is only to provide big-numbers-evidence on how different the
How Can Generative AI Enhance Software Management? 245
results can be because of the formulation of the prompt, as reported in the second
column of Table 4.
These differences might be explained by the directness of the question formula-
tion and the LLMs’ interpretation of the context. For example, the question “Are we
done?” is very open-ended, even though the Scrum Guides with the rules for DoD
were given as context [23]. GPT might interpret this question to mean if the work
day is finished or if the item is meeting the DoD as desired.
The tests were conducted with and without randomizing the order of the criteria
and repeated 3 times to reduce the impact of hallucinations in the results. As a result,
a total of 6141 queries where sent through API (all responses in Supplementary
Materials). Out of all possible combinations of the DoD criteria and prompt
question, 87.3% consistently received the same reply (positive or negative). Based
on these results, we can state that changing the prompt changes the GPT’s
evaluations even when given the same contextual information (i.e., completed
criteria) but the order of the criteria do not. As a final step for this experiment, we
computed the frequency of apparitions of criteria in all the prompts. Surprisingly,
we could not observe any predominant exclusion criteria (condition sine qua non)
for DoD according to GPT-3.5. Figure 2b (right) shows that the lack of some tests
(unit, integration, and acceptance) and not having released the feature make GPT
decide to reply with a “no.”
1.0 10
0.8 8
Frequency (%)
0.6 6
0.4 4
0.2 2
0.0 0
unit tests are writte[...]
2 4 6 8 10
Number of fullfilled criteria
(a)
(b)
Fig. 2 Results of experiment 3. (a) Positive replies with respect to the number of fulfilled criteria.
(b) Criteria in prompts for work items flagged as not done
246 B. Cabrero-Daniel et al.
Table 7 Percentage of times Completed items Prompt instances GPT says yes /
that GPT checks all the
(ground truth) (num. of tests) prompt instances
elements in the checklist
depending on the percentage 16.66% 6 0%
of completed items 33.33% 30 0%
50% 60 0%
66.66% 60 14.28%
83.33% 30 61.53%
100% (6 out of 6) 6 100%
This experiment was repeated using OpenAI’s gpt-4-0613 model and with
Google’s Bard to study whether these assumptions are common across different
LLMs. In both cases, the results in Table 7 are improved, reaching 100% accuracy,
and none of the items in the checklist are checked if not explicitly addressed
by one of the tasks in the provided list. The intuition behind it is that once the
mapping between the checklist items and the completed task is done, a task that
both GPT and Bard excel at, Bard resembles a rule-based approach, while GPT-3.5
makes assumptions and attempts to fill the gaps in the completed tasks. Section 5
discusses the trade-off between these approaches and highlights the need for a better
understanding on how human developers would interact with generative AI tools for
the purposes here described.
In this experiment, the LLMs were given a set of requirements for a system followed
by a natural language sentence, introduced as written by a developer, which states
which functionalities and quality properties are already implemented in the system.
Then, the LLM was asked whether the system met the requirements based on that
sentence or not. A total of 52 human-written sentences were tested in the scope of
this experiment. This application aligns with the strengths of LLMs in understanding
and generating text. All the replies were gathered and stored for later analysis. What
248 B. Cabrero-Daniel et al.
follows is a discussion of the strengths and shortcomings of GPT and Bard in this
task (Table 8).
First, the chatbots were given a complete list of ten safety requirements for a
system. In order to evaluate whether the LLMs prioritized important requirements,
the safety requirements were written using “shall” or “should” to denote mandatory
and recommended requirements, respectively. For instance, while “the system
shall avoid collisions with a pedestrian,” “the system should minimize its power
consumption.” The cases were divided in two cases, as reported in Table 9. On
the one hand, cases where 100% of the requirements are met, i.e., the sentence
stated that all of the requirements “shall” requirements were met but some “should”
requirements might be missing, and, on the other hand, cases where only some
of the “shall” requirements where met. In the first case, GPT-3.5 replied that the
requirements were met 75% of the times, and these instances correspond to true
positives. However, when the “shall” requirements were partially met according to
the statement, GPT-3.5 reported them being met in almost a third of the instances
(false positives). Taken together, these results show that GPT-3.5 has an accuracy of
69.23% in this binary classification task.
Then, the explanations for the decisions were analyzed separately. In the case of
false positives, that is, implementation descriptions that did not explicitly address
all the requirements but that were classified as doing so, GPT often justified its
decision using the hierarchy of the requirements. For instance, when the sentence
stated that “the system has a functional processor at all times,” GPT judged that two
requirements were met: (i) “at least one processor shall be functional at all times,”
and (ii) its parent requirement, “the system shall have redundant processors,” even
though they were not connected explicitly in the prompt. Highlight dependencies
in the requirements are an interesting emerging result that might be helpful to
developers performing this task. However, these assumptions might introduce errors
in the logic process.
How Can Generative AI Enhance Software Management? 249
Another surprising behavior was detected during the analysis of the replies by the
GPT-3.5 and GPT-4 models. In the replies, GPT models often said that the system
was not safe even though all of the safety requirements were met because one or
more of the requirements conflicted with the notion of safety, which was not defined
in the prompt. For example, “Driving fast can increase the risk of collisions and
compromise safety.” These results prove that the GPT models are not acting as a
rule-based system. Instead, the models understand the context, create a hierarchy
and mapping of the requirements, detect conflicts, and are able to come up with
sensible and understandable warnings for the user. Interestingly, Bard did not warn
the user about the dangers of speeding, and often stated that speed “is not
relevant to safety” because “speed is not a safety factor in and of itself and a system
can be safe at any speed.”
5 Discussion
5.1 Oversight
This result addresses RQ1, which questions how GenAI tools perform in
software management tasks, compared to humans. Experiment 1 and 2 show that
different LLMs can estimate time to complete a task, potentially being useful
in capacity planning meetings. This estimate could be used as a reference point,
complementing human judgment rather than replacing it entirely.
In the scope of both the EU AI Act [9] and the Scrum Guides [23], it is important
to remind the reader that such a GenAI-powered system would need to be used as a
suggestion tool for time estimations and appropriate measures should be put in place
to avoid unnecessary stress on the employee or foster unfair evaluations the quality
of engineer. The tool could be employed at individual, team, or managerial level
to ensure that estimations are sensible and complement human judgment without
undermining their agency.
However, engineers might feel discomfort when being judged by an AI, particularly
in a domain that involves intellectual activities. The introduction of GenAI could
potentially create pressure on developers, as its integration might be accompanied
by expectations of increased efficiency and accuracy The introduction of GenAI
in software management has stirred debates about its suitability, user base, ethical
implications, and the potential impacts on developers. The ethical responsibility lies
in striking a balance where GenAI aids developers without placing undue stress on
them, ensuring a supportive environment that values both human expertise and AI
assistance.
One significant observation highlighted in Experiment 2 is that while humans
tend to overestimate the time required for project implementation when dealing with
obsolete requirements, GPT appears to be free from this bias, indicating its potential
utility as an estimation reviewer, addressing RQ2 on whether there are any tasks
where GenAI surpasses humans.
While LLMs exhibit impressive prowess in various tasks, the scope of their
effectiveness is not without constraints. The results of Experiment 3, on the other
hand, highlight the complexities involved in using language GPT for determining
task completion. The more criteria a task fulfils, the more likely GPT is to
considered it “done.” Once again, this issue might be due to the non-adapted prompt
instance. Also, the diversity of the training data can impact its capacity to generate
contextually appropriate responses.
This contrasts with the results in Experiment 4 and 5 where hierarchies arose
between the items in the prompt and GPT used a more complex reasoning. The
observed nuances in the reasoning of GenAI tools, reported in Sect. 4.3, provide
insights to address RQ2.
It is worth reminding that different models showed not only different perfor-
mances but also different reasoning stiles. For instance, GPT-4 and Bard appear
to diverge from GPT-3.5’s approach in Experiments 4 and 5, possibly indicating a
progression toward refining AI logic to be more cautious with the assumptions made
during the reasoning, in line with the findings in [6]. The discussion then naturally
turns to the question of whether simplicity is better or worse. Simpler logic may
lead to more conservative assumptions, potentially reducing the risk of incorrect
judgments based on indirect associations. On the other hand, more complex logic,
as seen in GPT-3.5, might lead to misinterpretations of the context and dangerous
situations if developers show automation bias.
6 Conclusion
Even though software management strategies using GenAI might not always surpass
human performance, they are reasonably able to conduct tasks that are often
overlooked by Scrum teams, according to previous studies [20]. However, as
presented in Sect. 4, their performance is often sub-par and very sensitive to prompt
engineering techniques. Taken together, the results raise the question of whether it
is better to have a GenAI tool assist in these tasks even though the results might
not be as good as desired? In order to answer this question, there should be careful
consideration of the trade-off between traditional approaches and potentially faster,
error-prone work.
All in all, integrating GenAI in software management offers distinct advantages
related to efficiency, reduced human error, and enhanced productivity. However,
the human factor remains crucial, particularly in complex problem-solving, creative
tasks, and areas that demand nuanced human judgement. However, the results also
highlight the difference between the baseline GenAI models and when software
management context is provided. The main takeaway message of the experiments
is that providing context, either in single prompts or in conversations, increases
the chance of achieving sensible results. This further motivates the usage of chain-
of-thought strategies to correct and conduct LLM reasoning [25]. For these reasons,
ethical concerns should be addressed to avoid potential negative consequences that
could emerge from a uninformed or hasty GenAI integration [9].
How Can Generative AI Enhance Software Management? 253
7 Future Work
GenAI tools’ ability to understand and generate text is powerful, but its compre-
hension lacks the nuanced understanding that would be required to make complex
decisions. What follows is a list of examples of tasks that are difficult to automate
using AI but that humans would benefit from having AI assistance:
– Understand user needs, and then forge a team; then, we need to estimate the
number person, team-weeks, lines of code, function points, or general effort to
set expectations of what can be accomplished in the given time frame.
– Preliminary backlog (44% devs without backlog [20]).
– Breakdown of large user stories during or previously to the meeting (56% of devs
do not [20]).
– Growth chart in story points [7].
– Suggest lightening the team’s methodology, or warn when not sufficient [7].
– Define the sprint goal at the start of the planning meeting, which is something
that 82% of developers do not do (neither at start or ever) [20].
– Manage project portfolios, and find stopping points [7].
As AI technologies continue to advance, there is potential for more sophisticated
solutions that combine AI-generated content with domain expertise, guiding us
toward a more reliable and sensible integration of GenAI within the software man-
agement pipeline. Such software management tasks need contextual understanding,
domain knowledge, and the ability to weigh different criteria against each other.
To conclude, we want to bring up again the question of whether using GenAI
can increase productivity in software management even though there’s no guarantee
of human-level performance. We therefore need to balance diverse GenAI tools and
human intelligence and collaborate with it.
Generative AI could be used to work alongside humans, providing them with prior
estimations when needed and warning them about potential bias (see Sect. 4.1). How
could developers effectively use multiple generative AI tools? Using multiple AI
tools can be approached in various ways, each offering its own set of advantages
and potential outcomes.
On the one hand, one approach is to employ these tools in parallel to humans
and using multiple GenAI tools simultaneously. This setup would allow for the
generation of diverse outputs from different AI models, enabling developers to
explore a wider range of creative solutions and responses. On the other hand, another
strategy involves using these AI tools sequentially, using one first (preferably the
better performing one) and then refining the output using the other tool could lead
to enhanced results. This could be done in an iterative loop to polish the output over
254 B. Cabrero-Daniel et al.
Fine-tuning LLMs for particular tasks holds the promise of enhancing their perfor-
mance and tailoring their responses to the task’s domain. Fine-tuning effectively
fine-tunes the model’s language generation skills to align with the nuances and
expectations of the designated task. However, fine-tuning may lead to overfitting,
where the model becomes excessively specialized and may struggle with generating
diverse and creative responses even within the domain.
Moreover, fine-tuning demands substantial amounts of task-specific data, which
might not always be readily available. For this reason, continuously training models
could be a good strategy. Continuously training the GenAI on real-world team-based
scenarios could enhance a possible GenAI coworker’s understanding over time.
The broader question raised is how AI systems strike a balance between making
assumptions and ensuring accuracy and how these assumptions and decisions could
be seamlessly integrated into different layers of the Agile development workflow.
GenAI could be trained like one more of your colleagues. This is the ultimate col-
laboration, no? This might hold the potential to revolutionize software management,
bridging the gap between automation and human insights.
References
1. Ahmed, I., Kajol, M., Hasan, U., Datta, P.P., Roy, A., Reza, M.R.: Chatgpt vs. bard: A
comparative study. UMBC Student Collection (2023)
2. Alberts, I.L., Mercolli, L., Pyka, T., Prenosil, G., Shi, K., Rominger, A., Afshar-Oromieh, A.:
Large language models (LLM) and chatgpt: what will the impact on nuclear medicine be? Eur.
J. Nuclear Med. Molec. Imag. 50(6), 1549–1552 (2023)
3. Beck, K.: Extreme Programming Explained: Embrace Change. Addison-Wesley Professional,
Boston (2000)
4. Cabrero-Daniel, B., Fazelidehkordi, Y., Ratushniak, O.: Trustworthy “blackbox” Self-Adaptive
Systems (2023)
How Can Generative AI Enhance Software Management? 255
1 Introduction
A. Nguyen-Duc ()
Norwegian University of Science and Technology, Trondheim, Norway
e-mail: anhn@ntnu.no
D. Khanna
Free University of Bozen-Bolzano, Bolzano, Italy
e-mail: dron.khanna@unibz.it
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 257
A. Nguyen-Duc et al. (eds.), Generative AI for Effective Software Development,
https://doi.org/10.1007/978-3-031-55642-5_12
258 Anh Nguyen-Duc and Dron Khanna
agement involves various practices and tools, including daily stand-up meetings,
sprint planning, sprint reviews, and retrospectives. Although it has gained popularity
recently, contemporary Agile software project management still faces many chal-
lenges [11]. These challenges encompass a range of factors, including scaling Agile
practices to larger and more complex projects, fostering effective cross-functional
teamwork, and ensuring that Agile principles are effectively integrated into orga-
nizational culture and structure. On the horizon of technological advancement,
we anticipate the emergence of visionary project management tools and practices
that will transform how projects are planned, executed, and controlled, potentially
addressing some of the existing challenges. One such innovation is the integration
of generative AI tools, such as ChatGPT1 (OpenAI) and Google’s Bard2 (Google),
into project management workflows. These tools can act as intelligent virtual
assistants, reducing the likelihood of human error, optimizing resource allocation,
and improving decision-making.
The potential adoption of GenAI tools in engineering activities has been
increasingly discussed in the Agile community [12, 23]. Our workshop, titled the
“AI-Assisted Agile Software Development Workshop,” part of the XP2023 confer-
ence, represents one of the pioneering events contributing to this ongoing discourse.
The workshop gave participants valuable insights and practical experiences with
ChatGPT in concrete Agile project development tasks. However, it’s important to
note that all the work conducted thus far remains in its early stages, consisting
primarily of conceptual ideas and preliminary results.
Moreover, the debate tends to focus on the technical capacity of AI tools
in performing tasks rather than on the managerial consideration of adopting AI
tools in Agile projects. This motivates us to conduct an empirical study on the
perception and intention of adopting ChatGPT among Agile project managers.
Understanding how Agile project managers adopt and use technology is essential
for implementing AI tools in the current Agile workflow. The results of this research
can serve as a valuable resource for software development teams, decision-makers,
and organizations considering the adoption of ChatGPT in Agile environments.
In the Nordic regions, the Agile methodology has gained substantial prominence
as a popular paradigm for software development across various contexts. It has
found application across diverse scenarios, spanning from small to large devel-
opment teams [4, 9] and various product categories, including Web application,
enterprise software and embedded systems [14, 20], and different organizational
context [5]. Furthermore, the availability of AI tools such as ChatGPT within
the Nordic regions underscores their readiness for technological innovation. These
twofold circumstances present the Nordic countries at the forefront of GenAI
adoption in a professional Agile project context.
1 https://openai.com/.
2 https://bard.google.com/.
Exploring the Adoption of ChatGPT in Agile Software Development 259
2 Background
user story creations, task breakdown and refinement and test case optimization [12].
Holzmann et al. conducted a Delphi study on 52 project managers to portray the
future AI applications in project management [13]. The author presented the list
of most crucial project management areas where GenAI can be adopted, i.e., the
creation of project schedule, WBS, and task list. Prifti et al. identified the pain
points or problems in project management, including safety, privacy, autonomy, data
availability, and employment [21]. Then, the paper proposes how AI’s assistance can
help project managers become more productive and skilled in organizations.
3 Research Approach
The objective of this 2023 survey was to understand and obtain knowledge of
AI assisting project management activities. The study includes two phases: (1)
identification of tasks with the possibility of adopting ChatGPT and (2) surveying
opinions of Agile professionals about adopting ChatGPT for such tasks. The
overview of the research process is shown in Fig. 2.
The study was initiated in February 2023. A literature review was conducted then;
however, only a little was found regarding GenAI and Agile project management.
In April 2023, we conducted an informal interview with an Agile project manager
in Oslo, Norway. The interviewee started using ChatGPT for professional work,
including correcting emails and idea generation. He has provided some ideas on
what kind of tasks ChatGPT can apply. During the AI-assisted XP workshop (men-
tioned in Sect. 1), a focus group was conducted to refine the questions in the context
of an Agile project that ChatGPT can assist. Afterward, key focus group members
conducted another online meeting to discuss strengths, weaknesses, opportunities,
and threats regarding adopting ChatGPT on each task. To simplify the list, we
classify the functions into two types: (1) technical activities (including requirement
work, software design, coding, testing, and deployment) and managerial activities
Based on the task list (Sect. 3.1), a questionnaire was designed in the first week of
August 2023. The purpose of the questionnaire is to gather the opinions of Nordic
Agile experts and evaluate the tendency of their perception of ChatGPT adoption in
the Agile context. The survey aims to describe a trend of a specific population. We
adopted a structured cross-sectional survey process between July 2023 and August
2023.
The survey consists of two main parts. Part 1 is questions about the background
and experience of the participants. Part 2 is a list of multiple-choice questions
using a 5-point Likert scale. In the Likert scale, two questions were (1) the level
of effort saved and generated and (2) the level of complexity to set it up. Five
standard choices existed (1 the least to 5 the most) under the two questions. This
was done so the participants could opt out if the respondent needed to be more
confident or interested in the question. Respondents can also specify freely via text
with enhanced responses. Each question is also provided with an opt-out option.
The sampling strategy for this study is convenient and localized. Responses are
collected from our professional network (ca. 23% of the total number of answers),
social media (ca. 15%), and professional recruitment platforms (ca. 62%).
The survey was available from July 15 to August 15. The total number of
responses is 115. After filtering out the invalid responses (empty responses,
responses with random answers, and responses from people without relevant
background or experience), the number of included responses for analysis is 73. The
demographic information of the survey respondents, i.e., their locations, knowledge
of Agile project, and experience with using ChatGPT or GenAI tools, are shown
in Fig. 3. As indicated, a significant proportion of the respondents originate from
Norway, which reflects the authors’ professional network. Most respondents have
experience working or managing Agile projects between 1 and 10 years. The
majority of respondents have employed GenAI tools, such as ChatGPT, in both their
everyday tasks and professional endeavors.
Table 1 (continued)
Category Tasks Reasons
Managerial Preparing for the Analyze documents, generate meeting plan and agenda,
activities sprint planning keep track of meeting minutes
meeting
Defining sprint goals Synthesize input from stakeholders and historical data
to assist in defining clear and achievable sprint goals
Selecting product Assist in selecting the most appropriate backlog items
backlog items by considering factors like priority, dependencies, and
business value
Creating a sprint Automate the process of populating the sprint backlog
backlog with selected items, ensuring that it reflects the sprint
goals and team capacity
Facilitating the Assist in organizing and structuring the review meeting
review meeting by generating agendas, tracking progress against sprint
goals, and prompting discussions on completed work
Preparing for the Analyze historical data and team feedback to prepare
retrospective meeting meaningful topics for discussion in the retrospective
meeting
Onboarding new team Providing automated onboarding guides and
members documentation, helping new team members quickly
integrate into existing workflows and processes
Supporting Analyze team dynamics and recommend strategies to
engineers’ retention improve team health and retain engineers
and team health
Documenting the Help create detailed and standardized documentation of
review outcome the review outcomes
Writing or improving Help compose well-structured and professionally
formal emails written formal emails, ensuring clarity, conciseness,
and appropriate tone, which can enhance
communication effectiveness
Meeting preparation Assist in meeting preparation by generating agendas,
with agenda, slides, creating presentation slides, and even automating
note-taking, etc. note-taking, saving time and ensuring organized and
productive meetings
Stakeholder Help manage stakeholders by sending meeting
management with reminders, suggesting communication strategies, and
meeting reminders, providing timely updates, facilitating smooth and
communication efficient interactions
strategy, etc.
Assisting technical Support technical writers by generating technical
writing documentation, explanations, and guides, ensuring
accuracy, consistency, and accessibility of technical
content
Exploring the Adoption of ChatGPT in Agile Software Development 265
Denmark
Finland
Sweden
Norway
0 5 10 15 20 25 30 35
Between 5 to 10 years
Between 1 to 5 years
0 5 10 15 20 25 30 35
Professional experience
No experience
0 5 10 15 20 25 30 35 40 45 50
are analyzed at the interval measurement scale (mean value). We documented and
labelled open-ended questions. We assume that the perception of the benefits and
sacrifices when adopting GenAI tools are independent of the respondent’s mind.
When comparing two separate sets of responses from a Likert question, we ran a
two-tailed independent t-test to compare the respondents’ perceptions of the benefits
and sacrifices. It is a known approach to measure user value as a trade-off between
relative quality and relative price or, in our case, between the perceived benefits
and perceived complexity. In an interval measurement scale, we calculated the user
value below. Technical and managerial tasks are then sorted according to the user
value.
4 Results
The Agile project managerial task outlines seventeen tasks we asked for in
the survey. We calculated their mean value in terms of perceived benefits and
complexity. We obtained the maximum perceived benefit for Code summarization
and explanation, i.e., (3.69). Also, very close to it, we received the Generating
code from given libraries mean value (3.66). Whereas the minimum perceived
benefit for Requirement modelling and analysis, i.e., (3.37). Among the seventeen
mentioned tasks, we did not find a higher degree of difference concerning the
mean values, the maximum and minimum mean value of the perceived benefit, i.e.,
(.3.69 − 3.19 = 0.50). Whereas in terms of the perceived complexity, we observe a
more significant difference, i.e., (.3.55 − 2.59 = 0.96), some Agile project technical
tasks were more challenging to achieve.
The technical tasks with the highest user values are code summarization and
explanation (1.1), generating code from libraries (0.87), user story correction
and refinement (0.73), and user story generation (0.7). T-test showed a statistical
difference in the distribution of respondents’ rates on the benefit and complexity
of these tasks. The technical tasks with negative user value (perceived complexity
more than perceived benefits) are designing overall architecture against a given set
of requirements (.−0.07), designing the detailed architecture of components from
the general ones (.−0.17) and requirement modelling and analysis (.−0.18).
Proposition 1 Overall perceived benefits exceed perceived complexity when it
comes to the adoption of GenAI in technical tasks.
We conducted t-tests between the benefits scores and complexity scores to test if
they are significantly different from each other. Ten tasks resulted in p-values lower
than 0.05, while seven yielded p-values higher than 0.05. These findings indicate
that our results are unlikely to be solely attributed to random chance, meaning that
for ten tasks, respondents differ in the associated benefits and complexity.
Proposition 2 Perceived benefits of adopting GenAI are different among technical
activities. The adoption of GenAI for code summarization and generation and user
story generation and refinement present the best user value.
Regarding the standard deviation (SD), we obtained a low value for most of
the tasks, indicating remarkable agreement among the respondents voting. Also,
the reactions and ratings were reasonably uniform, with less disagreement in their
perceptions. This also means that we will obtain similar responses if we conduct the
same survey questions with the same respondents and under the same conditions.
While the perceived value of the user value has a relatively large range from 1,1
to .−0,18, the range of perceived benefits does not vary that much (between 3,75
and 3,34). This shows a significant impact of perceived complexity on realizing
the benefits of GenAI among technical activities. In Table 2, tasks related to
Exploring the Adoption of ChatGPT in Agile Software Development 267
requirements and coding rank at the higher end, while tasks concerning processes
and architecture are positioned at the lower end of the spectrum.
268 Anh Nguyen-Duc and Dron Khanna
Proposition 3 The user value for adopting GenAI in architectural and process-
related tasks is notably lower when contrasted with coding and requirement tasks.
The Agile project managerial task outlines thirteen tasks we asked for in the survey.
We calculated their mean value in terms of perceived benefits and complexity. We
obtained the maximum perceived benefit for writing or informal emails & assisting
technical writing, i.e., (3.9) and the minimum perceived benefit for Supporting
engineers’ retention and team health, i.e., (2.92). We observed that technical-related
tasks have higher benefits than human-factor-related tasks. Among the thirteen
mentioned tasks, we found a higher degree of difference concerning the mean
values of perceived benefit, i.e., (.3.9 − 2.92 = 0.98) around a mean value of 1
compared to the previous section, whereas in terms of the perceived complexity,
we observed a more significant difference, i.e., (.3.37 − 2.07 = 1.30); some Agile
project managerial tasks were more challenging to achieve.
The managerial tasks with the highest user values are writing or improving formal
emails (1.83), assisting technical writing (1.53), preparing meetings with agendas,
slides, note-taking (1.07), and documenting review outcomes (0.82). T-test showed
a statistical difference in the distribution of respondents’ rates on the benefit and
complexity of these tasks.
Proposition 4 Overall perceived benefits exceed perceived complexity when it
comes to the adoption of GenAI in managerial tasks.
The managerial task with negative user value (perceived complexity more than
perceived benefits) is Supporting engineers’ retention and team health (.−0.45).
Also, we found out that only one t-test for Supporting engineers’ retention and
team health is 0.05 and does not prove that there is a significant effect. Moreover,
only four tasks obtained a t-test value above 0.05; the eight were less than 0.05. This
proves that our results are unlikely to have appeared by random chance alone. The
first eight task results are primarily solid and well-grounded, stipulating a higher
degree of confidence in the observed effect. Regarding the standard deviation (SD),
we obtained a low value for almost all thirteen tasks, meaning the responses were
tightly clustered around the mean value. The low SD value obtained indicates more
remarkable agreement among the participants in voting.
While the perceived value of the user value has a relatively large range from 1,83
to .−0,45, the content of perceived benefits does not vary that much (between 3,9
and 2,92). This shows the impact of perceived complexity on realizing the benefits
of GenAI among management activities.
Proposition 5 User values vary significantly among managerial activities. The
adoption of GenAI for project communication and documentation presents the best
user value.
Exploring the Adoption of ChatGPT in Agile Software Development 269
5 Discussion
Our findings reveal a positive attitude among our respondents when harnessing the
capabilities of ChatGPT. Our list of identified managerial and technical tasks (Table
1) covers all tasks of Agile copilot vision from Hoda [12]. When compared to
Holzmann’s anticipated project management scenarios [13], there is an alignment
on 7 out of the 13 managerial tasks, as shown in Table 3. However, notably, the
managerial tasks that hold the highest practical value for users are absent from
270 Anh Nguyen-Duc and Dron Khanna
Holzmann’s expected list. Tasks like the creation of project schedules, analyzing
implications of missing deadlines, and updating project progress and schedules are
notable omissions. This omission could be attributed to the substantial contextual
information required for these tasks, and it’s possible that our participants needed to
fully explore ChatGPT’s capabilities to realize this level of functionality.
Bera et al. describe ChatGPT as a virtual team member who could perform
the tasks of an Agile coach or scrum master [3]. While our examination of tasks
demonstrates ChatGPT’s application in various tasks traditionally carried out by
humans, the role of ChatGPT currently remains at the assistance level. Mainly
all the tasks (see Tables 2 and 3) such as sprint backlogs, online meetings [16],
Agile retrospective [15], etc. applied by humans can be assisted by ChatGPT as a
virtual assistant rather than a virtual member. It aids in completing discrete tasks
but does not entirely automate entire workflows. The reason for this might be
that, at the moment, the capacity of ChatGPT, at least the available version, is not
yet ready to handle complex decision-making, context understanding, and various
interconnected tasks. Another reason can be adopting ChatGPT in sensitive PM
tasks, especially when considering confidential or legal concerns in the project.
6 Conclusions
This research unveils the current practices and perceptions regarding the adoption of
ChatGPT in Agile software development. Our survey of 73 software professionals
in the Nordic region highlights a range of tasks that can be enhanced with
ChatGPT and assesses the perceived value associated with these tasks. The study
provides insights into how ChatGPT can improve various aspects of Agile software
development based on a value-based approach. Technical tasks such as generation
of code, summarizing and explaining the codes, and managerial tasks related to
communication and documentation such as technical content and emails, meeting
agendas, note-taking could lead to better, faster, and efficient software development
and delivery process. Moreover, human factor and team dynamics-related tasks
such as onboarding team members and communications strategies could lead to
better collaboration, communication, and decision-making. ChatGPT could poten-
tially reduce the company developers’ load by focusing on tasks like user story
generations, corrections and its refinement, automate commit workflow, and helping
team leads in providing guidance on software design and performance optimization.
However, we are limited in empirically supporting this claim. Future research can
further evaluate our task list using objective measurements to gauge the actual value
derived from adopting ChatGPT. Longitudinal studies in established settings are
needed to gain a deeper understanding of the sustainable benefits, challenges, and
lessons learned. These insights will be instrumental in assessing long-term trends
and the enduring impact of ChatGPT in Agile software development.
References
1. Ajzen, I.: The theory of planned behavior. Organiz. Behav. Human Decis. Process. 50(2), 179–
211 (1991)
2. Alhashmi, S.F.S., Alshurideh, M., Al Kurdi, B., Salloum, S.A.: A systematic review of
the factors affecting the artificial intelligence implementation in the health care sector. In:
Hassanien, A.E., Azar, A.T., Gaber, T., Oliva, D., Tolba, F.M. (eds.) Proceedings of the
International Conference on Artificial Intelligence and Computer Vision (AICV2020), pp. 37–
49. Advances in Intelligent Systems and Computing, Springer International Publishing
272 Anh Nguyen-Duc and Dron Khanna
3. Bera, P., Wautelet, Y., Poels, G.: On the use of ChatGPT to support agile software development.
In: Agil-ISE 2023: 2nd International Workshop on Agile Methods for Information Systems
Engineering (Agil-ISE 2023): Short Paper Proceedings of the Second International Workshop
on Agile Methods for Information Systems Engineering (Agil-ISE 2023): Co-located with the
35th International Conference on Advanced Information Systems Engineering (CAiSE 2023),
vol. 3414, pp. 1–9. CEUR. ISSN: 1613-0073
4. Berntzen, M., Hoda, R., Moe, N.B., Stray, V.: A taxonomy of inter-team coordination
mechanisms in large-scale agile. IEEE Trans. Softw. Eng. 49, 699–718. Accepted: 2022-05-
05T07:41:26Z Publisher: Institute of Electrical and Electronics Engineers (IEEE)
5. Cico, O., Souza, R., Jaccheri, L., Nguyen Duc, A., Machado, I.: Startups transitioning from
early to growth phase - a pilot study of technical debt perception. In: Klotins, E., Wnuk, K.
(eds.) Software Business. Lecture Notes in Business Information Processing, pp. 102–117.
Springer International Publishing, Cham
6. Dam, H.K., Tran, T., Grundy, J., Ghose, A., Kamei, Y.: Towards effective AI-powered
agile project management. In: 2019 IEEE/ACM 41st International Conference on Software
Engineering: New Ideas and Emerging Results (ICSE-NIER), pp. 41–44
7. Daun, M., Brings, J.: How ChatGPT will change software engineering education. In: Proceed-
ings of the 2023 Conference on Innovation and Technology in Computer Science Education V.
1, pp. 110–116. ITiCSE. Association for Computing Machinery (2023)
8. Davis, F.D.: Perceived usefulness, perceived ease of use, and user acceptance of information
technology. MIS Quarterly 13(3), 319–340. https://doi.org/10.2307/249008. publisher: Man-
agement Information Systems Research Center, University of Minnesota
9. Dingsoeyr, T., Falessi, D., Power, K.: Agile development at scale: the next frontier. IEEE Softw.
36(2), 30–38. Conference Name: IEEE Software
10. Gao, L., Bai, X.: A unified perspective on the factors influencing consumer acceptance
of internet of things technology. Asia Pacific J. Marketing Logist. 26(2), 211–231 (2014).
Publisher: Emerald Group Publishing Limited
11. Hoda, R., Murugesan, L.K.: Multi-level agile project management challenges: a self-organizing
team perspective. J. Syst. Softw. 117, 245–257 (2016)
12. Hoda, R., Dam, H., Tantithamthavorn, C., Thongtanunam, P., Storey, M.A.: Augmented agile:
human-centered AI-assisted software management. IEEE Softw. 40(4), 106–109. Conference
Name: IEEE Software
13. Holzmann, V., Zitter, D., Peshkess, S.: The expectations of project managers from artificial
intelligence: a delphi study. Project Manag. J. 53(5), 438–455 (2022)
14. Jabangwe, R., Nguyen-Duc, A.: SIoT framework: Towards an approach for early identification
of security requirements for internet-of-things applications. e-Informatica Softw. Eng. J. 14,
77–95 (2020)
15. Khanna, D., Wang, X.: Are your online agile retrospectives psychologically safe? The usage of
online tools. In: International Conference on Agile Software Development, pp. 35–51. Springer
(2022)
16. Khanna, D., Nguyen-Duc, A., Wang, X.: From mvps to pivots: A hypothesis-driven journey
of two software startups. In: Software Business: 9th International Conference, ICSOB 2018,
Tallinn, Estonia, June 11–12, 2018, Proceedings 9, pp. 172–186. Springer (2018)
17. Kim, H.W., Chan, H.C., Gupta, S.: Value-based adoption of mobile internet: an empirical
investigation. Decision Support Syst. 43(1), 111–126. Publisher: Elsevier Science, Netherlands
18. Lin, T.C., Wu, S., Hsu, J.S.C., Chou, Y.C.: The integration of value-based adoption and
expectation–confirmation models: an example of IPTV continuance intention. Decision Sup-
port Syst. 54(1), 63–75
19. Min, S., So, K. K. F., Jeong, M.: Consumer adoption of the uber mobile application: Insights
from diffusion of innovation theory and technology acceptance model. In: Future of Tourism
Marketing. Routledge, p. 14 (2021)
20. Nguyen Duc, A., Khalid, K., Lønnestad, T., Bajwa Shahid, S., Wang, X., Abrahamsson, P.:
How do startups develop internet-of-things systems - a multiple exploratory case study. In:
2019 IEEE/ACM International Conference on Software and System Processes (ICSSP), pp.
74–83
Exploring the Adoption of ChatGPT in Agile Software Development 273
21. Prifti, V.: Optimizing project management using artificial intelligence. Eur. J. Formal Sci. Eng.
5(1), 29–37 (2022)
22. Sohn, K., Kwon, O.: Technology acceptance theories and factors influencing artificial
intelligence-based intelligent products. Telemat. Informat. 47, 101324 (2020)
23. Sravanthi, J., Sobti, R., Semwal, A., Shravan, M., Al-Hilali, A.A., Bader Alazzam, M.: AI-
assisted resource allocation in project management. In: 2023 3rd International Conference on
Advance Computing and Innovative Technologies in Engineering (ICACITE), pp. 70–74
24. Sukkar, A.A., Hasan, H.: Toward a model for the acceptance of internet banking in developing
countries. Inf. Technol. Develop. 11(4), 381–398 (2005). https://onlinelibrary.wiley.com/doi/
pdf/10.1002/itdj.20026
25. Venkatesh, V., Morris, M.G., Davis, G.B., Davis, F.D.: User acceptance of information
technology: Toward a unified view. MIS Quarterly 27(3), 425–478. https://doi.org/10.2307/
30036540
Early Results from a Study of GenAI
Adoption in a Large Brazilian Company:
The Case of Globo
Abstract Given the nascent and evolving nature of the latest generative AI tools,
there is little advice as to how best adopt generative AI tools within software
teams or what benefits and concerns can be expected. In this chapter, we share the
experiences of Globo, a large media group that has recently begun to adopt OpenAI
ChatGPT and GitHub Copilot for software development activities. We describe
Globo’s adoption approach and provide early insights into potential benefits and
concerns in the form of eight initial lessons that are apparent from diaries kept
by developers as well as semi-structured interview with them. Among the lessons
learned are that the use of generative AI tools drives the adoption of additional
developer tools and that developers intentionally use ChatGPT and Copilot in a
complementary manner. We hope that sharing these practical experiences will help
other software teams in successfully adopting generative AI tools.
1 Introduction
Software teams have always adapted their ways of working in response to techno-
logical and social advances. From the adoption of agile and lean ways of working,
through embracing DevOps practices, to utilizing cloud infrastructure, modern-day
engineering teams have accelerated their deliveries to where development cycle
times are measured in hours and days with multiple deployments to production
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 275
A. Nguyen-Duc et al. (eds.), Generative AI for Effective Software Development,
https://doi.org/10.1007/978-3-031-55642-5_13
276 G. Pereira et al.
per day [12]. Generative AI, as evidenced through tools like OpenAI ChatGPT1
and GitHub Copilot,2 is the latest disruption that is expected to have a profound
impact on software teams [6]. An often touted key benefit is an increase in developer
productivity due to the ability of generative AI tools to speed up development
activities, such as coding [14] and testing [9], by automatically generating code.
Leveraging this automation requires developers to change the way they work; they
spend less time writing code and more time reviewing and understanding code [5].
Yet adopting new ways of working and new tools is challenging for individuals,
teams, and companies. It can be time-consuming and expensive to retrain teams to
use new tools and to adapt their practices, especially when the tools are immature
and still evolving with little practical guidance, all while ensuring that code is still
shipped and deadlines are met. Companies thus need to be clear on the benefits
before embarking on a potentially lengthy and costly journey of adopting generative
AI tools [1].
To help others who are considering embarking on such a journey, we provide this
preliminary experience report of Globo, a media group that has recently adopted
both OpenAI ChatGPT and GitHub Copilot for software development activities.
We share the approach for rolling out the tools and provide early insights into
the potential benefits and issues of using generative AI. We defined the following
research question: what are the main lessons learned by adopting generative AI for
software development at Globo?
The remainder of this chapter is organized as follows: Sect. 2 presents back-
ground information, while Sect. 3 introduces the research design. Section 4 presents
the eight lessons learned, while Sect. 5 concludes the chapter.
2 Background Information
This section provides a brief overview of the capabilities, benefits, and limitations
of GenAI tools used by software developers, with a focus on the two adopted by
Globo: OpenAI ChatGPT and GitHub Copilot.
GenAI is artificial intelligence capable of generating text, images, or other media,
using generative models [19]. These models learn patterns from input data to
generate data with similar characteristics. Recent advances in neural networks have
led to a number of GenAI tools. Some are general-purpose conversational agents
(e.g., OpenAI ChatGPT, Google Bard3 ), while others are geared toward a specific
audience such as developers (GitHub Copilot). In both cases, the models have been
trained on input data scoured from many sources on the Internet, including some that
1 https://openai.com/chatgpt.
2 https://github.com/features/copilot.
3 https://bard.google.com/.
Case Study of GenAI Adoption 277
are software engineering relevant (e.g., code from open-source projects maintained
on GitHub).
Within software development, ChatGPT has been shown to assist developers
in a variety of common software development activities, including authoring
requirements [27], generating architecture [2] and design models [8], fixing defects
[24], generating code [20], and seeking help [13].
Research on Copilot notes that it provides developers a perceived productivity
boost, enabling them to work faster on repetitive tasks and an ability to focus
on more satisfying work [14]. Code generated by Copilot is of low cyclomatic
complexity [21] and is of the same complexity and as readable as human-generated
code [3]. When given coding problems, Copilot has been shown to generate valid
(e.g., compilable, interpretable) code although less than half of the solutions were
correct [28]. As well as coding, Copilot can assist with authoring unit tests and
identifying defective code [5].
However, it is not all positive news when it comes to using GenAI for software
activities. Drawbacks of using GenAI include hallucination, as in when ChatGPT
generates code that references libraries or packages that do not exist. This can
provide an exploitation entry point for bad actors [16], leading to security risks
[25]. Also, code generated by ChatGPT and Copilot can contain vulnerabilities
[15]. The code may contain bugs that may not be immediately apparent on initial
inspection [18]. In addition to these security and technical issues, there are wider
societal concerns such as potential copyright issues [17], the potential for bias [7],
and emerging legal issues [23].
Moreover, developers need to change their working practices to derive benefits
from GenAI tools. They need to be cognizant of recommended prompt engineering
techniques [26], that is, how to best structure and organize their prompts to gain
helpful answers. Also, to overcome some of the limitations of GenAI tools such as
generating defective or insecure code, engineers need to review and understand the
code [20].
3 Research Design
The setting of the study is the Digital Platforms structure within Globo,4 a large
media group based in Latin America. This structure is part of the Digital Hub, which
is Globo’s Digital Technology division, and has three areas: GloboID (identity
provider), Webmedia, and Publishing platform. These areas have approximately 250
people who provide solutions for digital products.
The study particularly concerned a pilot of introducing ChatGPT and Copilot
to six teams with over forty employees total in various roles (e.g., developers,
UX, DevOps, product owners, managers). A typical team consists of about five
4 https://grupoglobo.globo.com/.
278 G. Pereira et al.
A multidisciplinary team was assembled to execute the project. This team consisted
of a project manager, representatives from the company’s legal department, and
researchers from two universities—PUCRS and UCI (one professor and one PhD
Case Study of GenAI Adoption 279
student each). Legal representatives were also included, as the company had several
legal concerns about the use of generative AI. These were mostly about data leakage,
code ownership, and code licenses that may govern the code being suggested by the
tools.
The academic researchers’ involvement in the study primarily consisted of
acting as unpaid consultants to the project team. They helped on designing the
study, including strategies for data collection and analysis. These also include
recommended metrics and qualitative data that would be helpful in assessing the
potential benefits of the use of generative AI tools in software activities. The project
consisted of three phases planned to be completed in five months from August 2023:
Preparatory (1 month), Pilot (1 month), and Rollout (3 months). At the time of
writing, the Rollout phase is ongoing.
The first phase was necessary to undertake any preparatory activities required
by the pilot and rollout, such as license procurement and agreement on the specific
metrics to be used to understand the outcomes of the study. It included meetings
with Digital Platforms management to define the study protocol, decide which teams
would take part in the study, and choose which of the six teams would do the pilot
before all teams participated in the rollout.
The aim of the pilot was to obtain feedback on the use of generative AI tools
from a single team, so as to be able to refine and align the study protocol with actual
experiences. The outcomes of this phase were presented to both the participants and
the management of the company for a final round of feedback.
Globo’s management selected the pilot team based on the characteristics and
availability of the team. The pilot team subsequently participated in a kickoff event
to familiarize its members with the project and establish the essential infrastructure.
Team members were instructed to use the tools freely but record experiences and
examples of use in a spreadsheet (called a “diary” in the remainder of this chapter).
Effectively, the team members were encouraged to experiment with the tools leading
to experiential learning.
Participants were encouraged to use the tools freely in any software engineering
activity they felt appropriate. Following the lessons learned in the pilot stage, the
third phase of the project has now been kicked off with six participating teams. The
same instructions were provided to these teams. This third phase is in progress.
Various of qualitative data was captured both during the pilot and the first few weeks
of the broader rollout. Data collected during the pilot consisted of:
280 G. Pereira et al.
• At the start of the pilot, an initial survey5 (pre-survey) captured the profile of the
participants and their expectations and concerns regarding the use of the tools.
• Daily, a free-style document for team members to share their day-to-day experi-
ences with the tools (the “diary”). In this document, participants were encouraged
to provide descriptions of tasks and situations in which they utilized the tools.
• Weekly, a 30-minute meeting took place between the researchers and the pilot
team to collect participants’ experiences of the tools that week. This included
discussions on the tasks they were trying to accomplish with the AI tools and
whether the tools were helpful.
• At the end of the pilot, a second survey6 (post-survey) captured participants’
perceptions of the tools, including their impact on the development process.
During the rollout phase, the same data was captured as in the pilot phase.
However, some adjustments were made in the diary document to collect which AI
tool was used for which tasks. We also collected how participants felt about their
GenAI experience on that particular day, using a three-point scale: happy, neutral,
and sad.
All the members of Team 1 participated in the pilot phase. In the current stage,
the initial survey was answered by 47 people distributed among the 6 teams, in the
roles shown in Fig. 1.
Due to the increase from one pilot team to six participating teams, in-person
debriefs were held every 15 days on a team basis for 30 minutes. We used the
following questions to guide the discussion:
• Please share the positive aspects and difficulties of interacting with ChatGPT and
Copilot.
• Do you feel more productive?
• Did you identify any security threats or data leaks when you used the tools?
• Do you know and use any prompt engineering techniques?
• How has using Copilot/ChatGPT increased or reduced creativity and innovation
in your coding, design, and other tasks?
• Do you think you are saving time?
• If so, what have you done with the time saved?
• How has using Copilot/ChatGPT affected your general attitude toward work, as
well as your personal well-being at work?
To date, we have interviewed each team two times. Table 1 indicates the type of
application the teams develop. Table 2 contains other relevant information about the
teams, the technologies they use, and software development processes and practices
adopted.
5 https://forms.gle/pPiF3vCAtahVrjcH9.
6 https://forms.gle/wwYiCKAg6V314KZKA.
Case Study of GenAI Adoption 281
Using the diaries and data collected during the meetings, the qualitative data analysis
process was conducted in two distinct phases by the researchers, comprising an
initial screening phase and a subsequent detailed analysis phase. In both phases,
one researcher examined the data to identify insights and look for themes that were
subsequently discussed and reviewed with other members of the research team. To
help guide the analysis, five dimensions were considered:
Developer Concerns We sought to identify the concerns and reservations
expressed by developers concerning the adoption of generative AI tools in their
daily workflow.
Positive and Negative Aspects A central aspect of our analysis was the exploration
of both positive and negative facets associated with the integration of generative
AI tools in software development, allowing for a balanced understanding of their
impact.
Tasks Enhanced We explored the specific types of tasks and software develop-
ment activities where generative AI tools proved to be particularly advantageous,
discerning their areas of usefulness.
Productivity Enhancement Lastly, we sought to evaluate the extent to which these
tools contributed to overall productivity gains within the software development
process.
Usage Across Roles An integral element of the analysis was an examination of
how developers and other team members across different roles were leveraging
generative AI tools, shedding light on variations in usage patterns.
This analysis resulted in the identification of the initial eight lessons described
later.
Our study is subject to some threats and limitations. First, the absence of quantitative
data, such as bug counts or delivery times, restricts the ability to provide concrete,
measurable insights into the impact of these tools on development processes. That
is, our study shares perceptions from developers that ultimately may not be true.
Complementary or quantitative studies are needed.
Second, the study’s focus on a single large company may limit the general-
izability of its findings. Lastly, qualitative data collection, particularly through
diaries and interviews, introduces the possibility of response bias and subjectivity.
While we sought to minimize this risk by comparing developers’ perspectives with
existing literature and interviewing a number of different teams to see if there was
commonality in the perceptions across teams, such risks will always persist.
Case Study of GenAI Adoption 283
4 Eight Lessons
This section presents lessons learned from our ongoing study, until now, based on
the teams described in Sect. 3.2. By analyzing the diaries and checkpoint meetings,
we have identified eight preliminary lessons about the adoption and usage of GenAI
by the developers. The lessons cover a variety of software development activities.
Some of the lessons apply to one of the tools only (e.g., Context Matters), whereas
others apply to both tools (e.g., Improving Unit Testing). These eight lessons are
shown in Fig. 2 and categorized by the applicable tool (ChatGPT and/or Copilot).
At Globo, the influence of GenAI went beyond its primary functions. The devel-
opment teams increasingly recognized that GenAI indirectly encouraged them
to explore tools and practices that enhance code quality and overall project
success. Specifically, the Globo developers started using Sonar7 not just as a post-
development auditing tool but especially to collect evidence of the efficiency of
GenAI (a requirement of the project). By analyzing the code generated or written
with the assistance of GenAI, Sonar provided valuable insights into code quality,
identified potential vulnerabilities, and highlighted areas for improvement. The
7 https://www.sonarsource.com/.
284 G. Pereira et al.
developers also used ChatGPT to help using Sonar. A developer mentioned, “I used
ChatGPT to ask a question about how to use Sonar on a monorepo. The answer
helped me confirm what I already thought was the way to go (Team 1, Dev 1).” The
developers noted that the integration of Sonar into the development process enabled
them to maintain a continuous focus on code quality.
Based on the conversation with the teams, we found that Copilot’s performance
varied significantly depending on the context of the application it is used with.
Developers working on legacy code found Copilot was not helpful, while others
found it more beneficial with new projects and codebases.
The integration of Copilot into projects featuring legacy code and older tech-
nology often presented developers with a unique set of challenges. Copilot’s
understanding of the context in such cases was less accurate due to outdated
conventions, coding practices, and technology stacks. It sometimes struggled to
provide relevant suggestions and did not always fully comprehend the intricacies
of legacy systems. Adapting Copilot to understand and work effectively with legacy
code often required manual adjustments and a deeper contextual understanding by
developers. As one developer working on legacy code noted, “Copilot can’t help.
286 G. Pereira et al.
It doesn’t seem to understand code that isn’t well-structured (Team 1, Dev 4).”
Interestingly, it appears that ChatGPT copes better with legacy code as the same
developer describes, “ChatGPT gives pertinent answers to specific questions, even
about legacy technologies (Team 1, Dev 4).”
According to the Globo developers, Copilot shined in the realm of new projects
and fresh codebases. Its understanding of modern coding conventions, current
technology stacks, and best practices was better. As one developer noted, “Copilot
is competent to recognize context within the file (Team 5, Dev 1),” and another noted
its support for modern-day coding practices: “Copilot looks smart for testing well-
structured code. Good for isolated components (Team 2, Dev 1).”
Overall, this is an important point for Globo to take forward and explore, given
that some of its codebase contains legacy code. A key outcome, then, is that it could
be beneficial to explore how to make GenAI tools more supportive of legacy code
perhaps by building an internal language model based on its own code base. How
this then balances with knowledge learned from external sources will remain to be
seen.
While Copilot really helped the teams, it relied on developers crafting high-quality
code to provide better suggestions. The responsibility for the quality of code
suggestions was a shared endeavor that hinged on the developer’s expertise. A Globo
developer said that “Copilot depends on the developer. For example, if you start
writing a method with a proper name, [it] makes better suggestions. Sometimes it
can’t even do it (Team 2, Dev 1).”
Developers had to be mindful of the code they wrote, ensuring it was readable
and maintainabl, and followed established coding standards. Copilot, in turn,
complemented the developer’s work by offering suggestions that aligned with the
code’s context. Copilot leveraged its understanding of code patterns, best practices,
and programming languages to offer suggestions. However, developers were the
ones who set the foundation by writing code that was well structured, adhered to
best practices, and was clear in its intent. To illustrate this, another developer said,
“Creating a function with a descriptive name will help the Copilot answer (Team 3,
Dev 2).”
Some Globo developers turned to ChatGPT as a viable and favored search platform
compared to Google, dedicated software engineering Q&A platforms such as Stack
Case Study of GenAI Adoption 287
Overflow,8 or even documentation. They did this because ChatGPT allowed them
to pose their questions in plain language and receive immediate, context-aware
responses. The Globo developers felt that this type of interaction significantly
reduced the time they spent searching and filtering through search results, making it
an attractive choice for developers seeking swift and precise answers. Specifically,
the developers noted that ChatGPT helped speed up their work as ChatGPT helps
with technical questions faster than Google (Team 5, Dev 2) and AI delivers
a more practical summary (Team 3, Dev 2) in comparison to traditional search
engines. Moreover, the developers felt the answers were more assertive than Google
searches, leading to accelerated work.
While it is clear that GenAI has found its place in coding and testing, there is a
widespread lack of awareness regarding its potential in other software development
activities, such as software requirements or software design. Most of the Globo
reports to date concerned how developers explored several ways of generating code,
obtaining debugging support, and expediting the testing process.
However, GenAI has the capability to assist beyond coding, such as in software
requirements, from generating user stories and use cases to aiding in requirement
analysis and documentation. One of the teams started exploring this, using it to
streamline the process by seeking recommendations from ChatGPT and ensuring
that requirements are well-defined. One team member mentioned that “ChatGPT
provides good support for writing documentation. I use it a lot for documentation
(Team 4, Dev 1).” (In this case, they mean requirements documentation, not code
comments.)
GenAI can also be a good resource in design activities. It has the capability to
assist in creating wireframes, generating design documents, and even automating
certain design tasks. By providing design suggestions, aiding in prototype creation,
and facilitating iterative design, GenAI can optimize the entire design process.
At Globo, some aspects of this were explored. For example, one team member
mentioned that “ChatGPT helped us with a task that required us to build a flow
diagram. It helped to elucidate complex flows in blocks of code (Team 4, Dev 2).”
However, our study so far showed that the use of GenAI beyond coding is
underexplored at Globo. The teams have not yet explored GenAI for requirements,
design, and even other activities such as software architecture. The examples
mentioned above are the only instances of its use for such tasks so far. At the same
time, recent literature suggests that this is common at other places too [22].
8 https://stackoverflow.com/.
288 G. Pereira et al.
4.9 Discussion
Although the adoption of GenAI is ongoing at Globo, these first eight lessons
provide some interesting perspectives for further consideration. One noteworthy
perspective is that when examining the lessons holistically, it seems that ChatGPT
and Copilot are complements, rather than alternatives. Based on the analysis, they
are complements for two reasons: (i) they support somewhat different use cases,
and (ii) where they support the same use case, they do so in different ways by
playing to each other’s strengths. For example, Globo developers are using ChatGPT
as a search companion (Lesson 6), because of its natural language user interface
and understanding of context. For AI-enabled pair programming (Lesson 2), both
ChatGPT and Copilot assisted the developer. A developer first would engage
with ChatGPT to explore the problem and solution space and then complete the
implementation using Copilot’s intelligent autocomplete.
While Globo developers reflected on how GenAI can help them in effective ways
to save time (e.g., improving testing (Lesson 3)), they also raised concerns such as
the ability of GenAI to support legacy codebases (Lesson 4) and the potential for
privacy risks (Lesson 8). These are crucial to address if the potential of GenAI is to
be fully realized. As stated earlier, how to do so remains an open question.
The adoption of GenAI tools is also spurring Globo to adopt additional tools
such as Sonar for measuring code quality and identifying potential vulnerabilities
(Lesson 1). This adoption could potentially assist with some of the potential privacy
Case Study of GenAI Adoption 289
and security risks noted in Lesson 8, though it is not a complete solution for that.
Again, the Globo teams will need to consider the limitations and what can be done
further.
Finally, it is somewhat surprising that GenAI has been little explored by Globo
developers for non-coding activities such as requirements engineering or design
(Lesson 7). It is unclear why this is, perhaps due to a lack of awareness of the
potential of GenAI beyond merely code and test generation. Indeed, the broader
research literature does not talk about this much [11], nor does the popular literature
(e.g., magazines, blogs). At the same time, because of its generative capabilities,
one would expect potential benefits precisely for activities such as requirements and
design, where creative and broad exploration can matter. Globo’s management plans
to look at this potential in this regard in detail.
5 Conclusion
Appendix: Surveys
Pre Survey
This qualitative research begins the execution of the study on the use of generative
AI tools in software development at Globo. The objective of the questionnaire is
to get to know the members of the teams participating in the study. Your feedback
and participation are very important to understanding the characteristics, skills, and
perspectives of each participant. The data will only be used for research purposes.
1. Team
2. Age
(a) 18–25 years old
(b) 26–35 years old
(c) 36–45 years old
(d) 46–55 years
(e) Above 56 years old
3. Gender
4. Experience in software development
(a) 1–3 years
(b) 4–7 years
(c) 8+ years
5. Certifications in software development
6. Programming languages you’ve worked with
(a) Python
(b) Java
(c) JavaScript
(d) C++
(e) W#
(f) Ruby
(g) Go
(h) Others
7. How long have you worked at the company?
(a) 1–3 years
(b) 4–7 years
(c) 8+ years
8. What is your role on the team?
(a) Coordinator
(b) Developer
(c) DevOps
Case Study of GenAI Adoption 291
Post-survey
This qualitative research finalizes the execution of the study on the use of generative
AI tools in software development at Globo. The objective of the questionnaire
is to understand the perceptions of the experiment participants, after a period of
using generative AI tools. Your feedback and participation are very important to
understanding the characteristics, skills, and perspectives of each participant. The
data will only be used for research purposes.
1. Team
2. Do you find generative AI tools useful for software development? Why?
3. What benefits (not limited to code) have you seen from using generative AI
tools?
292 G. Pereira et al.
4. What negative aspects (not limited to code) have been observed in the use of
generative AI tools?
5. Do you believe that generative AI tools have facilitated or accelerated the
software development process? Why?
6. Have you noticed any differences in code quality when using generative AI
tools? In what aspects?
7. Regarding codes written by you without the support of the AI tool, what
differences do you notice?
8. Have generative AI tools helped you avoid common mistakes or identify
problems in the code? In what way?
9. Do you think generative AI tools have promoted greater knowledge sharing
among team members?
10. Describe difficulties or limitations you have encountered when using generative
AI tools.
11. Based on your experience during the experiment, what do you think could be
improved in generative AI tools?
References
10. Hannay, J.E., Dybå, T., Arisholm, E., Sjøberg, D.I.: The effectiveness of pair programming:
A meta-analysis. Information and software technology. Inf. Softw. Technol. 51(7), 1110–1122
(2009)
11. Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., Wang,
H.: Large language models for software engineering: A systematic literature review (2024)
(arXiv:2308.10620). arXiv. https://doi.org/10.48550/arXiv.2308.10620
12. Humble, J., Kim, G.: Accelerate: The science of lean software and devops: Building and scaling
high performing technology organizations. IT Revolution (2018)
13. Kabir, S., Udo-Imeh, D.N., Kou, B., Zhang, T.: Who answers it better? An in-depth analysis of
ChatGPT and stack overflow answers to Software Engineering Questions (2023). https://doi.
org/10.48550/arXiv.2308.02312. http://arxiv.org/abs/2308.02312
14. Kalliamvakou, E.: Research: Quantifying GitHub Copilot’s impact on developer productivity
and happiness (2022). https://github.blog/2022-09-07-research-quantifying-github-copilots-
impact-on-developer-productivity-and-happiness/
15. Khoury, R., Avila, A.R., Brunelle, J., Camara, B.M.: How secure is code generated by Chat-
GPT? (2023). https://doi.org/10.48550/arXiv.2304.09655. http://arxiv.org/abs/2304.09655
16. Lanyado, B.: Can you trust ChatGPT’s package recommendations? (2023). https://vulcan.io/
blog/ai-hallucinations-package-risk/
17. Lucchi, N.: ChatGPT: a case study on copyright challenges for generative artificial intelligence
systems. Eur. J. Risk Regul., 1–23 (2023). https://doi.org/10.1017/err.2023.59
18. Moradi Dakhel, A., Majdinasab, V., Nikanjam, A., Khomh, F., Desmarais, M.C., Jiang,
Z.M.J.: GitHub copilot AI pair programmer: asset or Liability? J. Syst. Softw. 203,
111734 (2023). https://doi.org/10.1016/j.jss.2023.111734. https://www.sciencedirect.com/
science/article/pii/S0164121223001292
19. Murgia, M.: Generative AI exists because of the transformer (2023). https://ig.ft.com/
generative-ai/
20. Nascimento, N., Alencar, P., Cowan, D.: Comparing Software Developers with ChatGPT:
An Empirical Investigation (2023). arXiv preprint arXiv.2305.11837. https://doi.org/10.48550/
arXiv.2305.11837. http://arxiv.org/abs/2305.11837
21. Nguyen, N., Nadi, S.: An empirical evaluation of GitHub copilot’s code suggestions. In:
Proceedings of the 19th International Conference on Mining Software Repositories, MSR
’22, pp. 1–5. Association for Computing Machinery (2022). https://doi.org/10.1145/3524842.
3528470. https://dl.acm.org/doi/10.1145/3524842.3528470
22. Ozkaya, I.: Can architecture knowledge guide software development with generative ai? IEEE
Softw. 40(05), 4–8 (2023). https://doi.org/10.1109/MS.2023.3306641
23. Ray, S.: Samsung Bans ChatGPT Among Employees After Sensitive Code Leak (2023).
https://www.forbes.com/sites/siladityaray/2023/05/02/samsung-bans-chatgpt-and-other-
chatbots-for-employees-after-sensitive-code-leak/
24. Sobania, D., Briesch, M., Hanna, C., Petke, J.: An Analysis of the Automatic Bug Fixing
Performance of ChatGPT (2023). https://doi.org/10.48550/arXiv.2301.08653. http://arxiv.org/
abs/2301.08653
25. Tal, L.: Can machines dream of secure code? From AI hallucinations to software vulnerabilities
(2023). https://snyk.io/blog/ai-hallucinations/
26. White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith,
J., Schmidt, D.C.: A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT
(2023). https://doi.org/10.48550/arXiv.2302.11382. http://arxiv.org/abs/2302.11382
27. White, J., Hays, S., Fu, Q., Spencer-Smith, J., Schmidt, D.C.: ChatGPT Prompt Patterns for
Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design (2023).
https://doi.org/10.48550/arXiv.2303.07839. http://arxiv.org/abs/2303.07839
28. Yetistiren, B., Ozsoy, I., Tuzun, E.: Assessing the quality of GitHub copilot’s code gener-
ation. In: Proceedings of the 18th International Conference on Predictive Models and Data
Analytics in Software Engineering. PROMISE 2022. pp. 62–71. Association for Computing
Machinery (2022). https://doi.org/10.1145/3558489.3559072. https://dl.acm.org/doi/10.1145/
3558489.3559072
Part V
Future Directions and Education
Generating Explanations for AI-Powered
Delay Prediction in Software Projects
Abstract A project failure can be attributed to complex negative factors that can
deviate project progress from the original schedules, and one of the root causes can
be a delay. Hence, the early detection of a delay sign can be a critical component
for the success of a project. One approach that contributes to solving the problem
can be the development of prediction models, and machine learning methods can
be a promising approach due to the recent success in other areas. Therefore,
we introduce an AI-based novel approach using an explainable graph neural
network that elucidates the causes of a delay without compromising its prediction
performance. Three experimental results demonstrate that (1) our model can predict
the delay with 4% higher accuracy on average, (2) our model returns a stable result
by providing a similar prediction performance with a similar explanation when the
same prediction tasks are given, and (3) the generated explanations can provide
actual reasons for the delay prediction given that the optimal threshold is used. These
points can provide a more supportive delay prediction system to users, reducing the
failure of projects in terms of time control.
1 Introduction
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 297
A. Nguyen-Duc et al. (eds.), Generative AI for Effective Software Development,
https://doi.org/10.1007/978-3-031-55642-5_14
298 S. Tomura and H. K. Dam
2 Motivating Example
1 https://spring.io/projects
300 S. Tomura and H. K. Dam
Fig. 1 Dependencies of SGF-314. The square boxes are attributes corresponding to each task. The
goal is to correctly predict the delay probability of SGF-314 with an explanation. The red-colored
nodes, edges, and attributes are highlighted as explanatory components for the prediction
Task dependencies form a graph of tasks where each node represents a task
and an edge represents a dependency between two tasks. Task nodes, edges, and
task attributes are considered as components of the graph. Predicting if a task
will be delayed or not can be seen as a node classification problem. We need to
predict a label (delayed or non-delayed) of an unlabelled node that is connected to
some labelled nodes. Graph neural networks (GNNs) have generated breakthroughs
in node classification across different domains. Thus, there is an opportunity to
leverage graph neural networks here, which is the first focus of this chapter. We
investigate if the use of a graph neural network would improve the predictive
performance of delay prediction.
The second focus of our work is proposing a method that can generate explana-
tions. The right-hand side of Fig. 1 illustrates an example of such an explanation.
Task SGF-314 is predicted to be delayed, and an explanation is provided in terms of
a small subset of features and dependent tasks, which are likely to contribute to the
delay. Those are the red-colored nodes, edges, and attributes in Fig. 1.
Automated support can be one of the approaches that can be offered to accom-
plish a delay prediction [4]. Machine learning-based support can help project teams
prevent a delay beforehand so that the project can be completed within the proposed
time frame. Several machine learning-based methods [5, 7, 8] can forecast a delay
in a project. However, these methods do not make use of dependencies between
task nodes. A task delay cannot be precisely predicted without the utilization of task
dependencies since a delay on one task may impact other tasks. The impact of task
dependencies has been analyzed and emphasized in the work of [25]. We believe
that using task dependencies and modelling them as a graph can improve accurate
delay prediction. This leads to our first research question:
RQ1. Does our graph-based method improve the delay prediction accuracy in
software projects?
Most of the work on explainable artificial intelligence (e.g., [3, 14, 19, 28, 31]) has
focused on four main evaluation criteria for explainability.
Accuracy evaluates the degree of negative effects that reduce the performance
of predictions itself. The addition of an explainable algorithm should not hurt the
prediction accuracy as a trade-off. This metric is the most commonly used evaluation
metric due to its simplicity for the measurement. A range of techniques (e.g., [17,
21, 29, 35, 40]) use accuracy for validating the robustness of their approaches.
Consistency examines the replicability of the same explanations under different
models. Explanations are accounted as inconsistent if generated explanations are
dissimilar by models, although both input and the corresponding output are the
same. Model-agnostic methods (e.g., [1, 26, 37, 38]) tend to achieve this criterion.
Fidelity specifies that the components in an explanation should actually explain
the prediction. If the chosen components’ values are manipulated, then the pre-
diction should be altered likewise. The importance of this concept is emphasized
in [15], which proposes several aspects that make the fidelity-based experimental
tests more reliable. Several research results (e.g., [15, 16, 30, 34, 37, 39]) utilize the
essence of fidelity to validate their methods.
Lastly, stability monitors the insusceptibility of explanation to the modification
of non-explanatory components. If the prediction is perturbed by manipulating
non-explanatory components, then the explanation does not fully elucidate the
prediction. A range of works (e.g., [2, 13, 27, 41]) have used stability as a metric for
assessing explainability.
Our approach is model agnostic; thus, it does not affect the prediction accuracy.
In addition, this approach assures consistency since explanations are not impacted
by the model structure, i.e., our approach is not tied to any specific prediction model.
Thus, we focus on the fidelity and stability of our approach, which leads to our two
research questions:
302 S. Tomura and H. K. Dam
RQ2. Does our method provide similar explanations for similar tasks that have
the same prediction outcome?
RQ3. Do the explanations demonstrate the model’s actual reasons for predict-
ing if tasks are delayed or not?
4 Proposed Approach
among nodes are stored in an adjacency matrix A. If two nodes form an edge, then
the value of the corresponding element in the matrix is one. Zero would be allocated
otherwise. A graph is then expressed as .G = {X, E, A}.
To predict if a given task will be delayed, we construct a graph that captures the
dependencies between this task and other tasks in the project. The graph is then
input to a Graph Convolution Network (GCN) model adopted from [20]:
where I is an identity matrix, . is rewritten as .A + I (. shows the task edges
between connected task nodes and itself), and W is a weight matrix demonstrating
weights conjoining two layers.
Our GCN model assumes one hidden layer, and hence .W (0) is the weights matrix
connecting the input layer with the hidden layer. Similarly, .W (1) is the weight matrix
between the hidden layer and the output layer. .fh is the activate function in the
hidden layer. .fout is the activate function for the output layer correspondingly. Conv
convolutes the graph to elicit patterns and clues for the prediction.
The calculation of the delay contribution level in each component (neighbor task
nodes, edges, and task attributes) reveals candidate components heavily influencing
Fig. 2 The overall flow of our approach: (1) Delay prediction. (2) Component importance calculation. (3) Base subgraph extraction. (4) Irrelevant task nodes
filtering. (5) Delay-explaining attributes identification
Generating Explanations for AI-Powered Delay Prediction in Software Projects
303
304 S. Tomura and H. K. Dam
the delay probability. We adopt Integrated Gradients [32], a method that calculates
the importance level of each component in a quantitative manner by measuring
a change (gradient) in a delay probability caused by adding a component. The
importance level of a component is then computed by the summation of all integrals
from the gradient.
We adopt Integrated Gradients for two reasons. Firstly, despite the simplicity,
precise importance calculation is achieved. According to the study in [31], the
performance of Integrated Gradients was the highest next to Class Activation Map
(CAM) [42]. Since CAM is available only when the last layer of a network is
a global pool, Integrated Gradients are more versatile. Another research [2] also
demonstrated that model-agnostic methods such as Integrated Gradients tended
to accomplish a higher and more stable performance. Secondly, model structures
do not limit the capability of Integrated Gradients as long as the models are
differentiable. GCNs are differentiable, and therefore Integrated Gradients can be
utilized to explain GCN’s prediction.
We compute the importance of a task node using Integrated Gradients as follows:
( ' ( '
))
a (
Σ ) Σm ∂F x + k
∗ xij − xij
' ij m 1
.I G(xi ) ≈ xij − xij ∗ ∗ (2)
∂xij m
j =1 k=1
'
where .xi is a task node in the graph, .xij is an attribute in .xi , .xij is the baseline
feature value for .xij , a is the total number of attributes in each task node, and m
indicates the total number of divisions for calculating the integral via quadrature.
The summation of attribute importance leads to the importance of the task node
itself.
Similarly, the importance of task edges can be measured in the following method:
( ' ( '
))
( '
) m ∂F e +
Σ io
k
m ∗ eio − eio 1
.I G(eio ) ≈ eio − eio ∗ ∗ (3)
∂eio m
k=1
where .eio shows the task edge between .xi and .xo . This calculates the gradient values
between with and without the target task edge.
Furthermore, attribute importance is calculated by the following equation:
( ' ( '
))
n (
Σ ) m ∂F x +
Σ
k
∗ xit − xit
' it m 1
.I G(Tt ) ≈ xit − xit ∗ ∗ (4)
∂xit m
i=1 k=1
n is the total number of nodes in the graph. The importance of the attribute .Tt is
acquired from the summation of the attribute .Tt importance in each node.
Generating Explanations for AI-Powered Delay Prediction in Software Projects 305
These three Integrated Gradients specify the importance level of each component,
which is used to extract the explanatory components that actually contribute to a
prediction outcome.
.δ(ci , co ) signals whether the selected task nodes belong to the same community as
a binary form. .γ is used for adjusting the community granularity. Initially, the value
is set as one, and as the value decreases, more communities containing a fewer
number of task nodes are created [24]. The goal is to embrace task nodes, which can
significantly change the modularity value Q by including them in a community.
The node clustering repeats twice for different purposes. The first clustering iden-
tifies the community holding the target node. The second clustering partitions the
selected community into smaller communities. The communities whose important
levels are equal to or higher than the target task node’s community are encompassed
as a segment of the final subgraph. The importance of each community can be
estimated as below:
'
Σ
n
imp(coms ) =
. imp(xi ) (6)
i=1
'
where .xi denotes a task node in the community s and .n is the total number of task
nodes included in s. Through this clustering, the rough selection of task nodes and
edges is achieved.
While the node clustering can filter the majority of irrelevant task nodes, the
subgraph may still contain fewer explanatory task nodes due to the community-wise
306 S. Tomura and H. K. Dam
where .xi ∗ is a task node in the subgraph and .τ is the threshold based on the percentile
of all task nodes’ importance values in the extracted subgraph. For example, Q3,
.Q3 + (Q3 − Q1) or .Q3 + (Q3 − Q1) ∗ 1.5 can be used as the threshold. After the
task node selection, all the task edges between non-target task nodes are removed.
Task edges that directly connect the target task node with others are the ones that
actually influence the prediction at a high level. Within the remaining task edges,
the quantile filtering is conducted in a similar way with Eq. (7). Task edges that are
higher than the edge threshold remain in the subgraph.
After identifying the final subgraph, each feature attribute’s importance is calculated
from Eq. (4). They are ranked in the order of the importance level as below:
The attributes that reach the highest rank are the most influential attributes for the
prediction, and as the rank lowers, the influential level also drops. This information
is tied to the delay explaining attribute analysis, which checks whether observed
attribute values in the target node exceed corresponding attribute thresholds. If an
attribute value is higher than the attribute threshold in the absolute value expression,
then it is treated as an outlied attribute value. Such a task attribute is selected as
one factor, which may increase the delay probability. If there is an attribute that is
ranked as the most influential attribute and the value is higher than the threshold,
then the attribute is regarded as a factor explaining the delay.
Attribute thresholds are computed from the quantile values from feature
attributes in all non-delay task nodes. The quantile for the attribute .Tt is calculated
from the attribute .Tt ’s value from all non-delay nodes. Equation (7) is used for
checking each target task node’s attribute. If zero is measured, then the attribute
value is determined as a delay factor.
Generating Explanations for AI-Powered Delay Prediction in Software Projects 307
5 Evaluation
the corresponding graph are separately used, and therefore five GCN models are
constructed in total.
The evaluation consists of four different experiments. The first experiment
compares the performance of our method with the state of the art in issue delay
prediction [6]. The second experiment is designed to investigate the stability of
the generated explanations and predictions. The last two experiments validate the
fidelity of our generated explanations.
Stability Check
Fig. 3 Comparison of performance (the performance of Stacked Learning is adopted from [6])
RQ2. Does our method provide similar explanations for similar tasks that have
the same prediction outcome?
Method This experiment consists of two rounds. After computing delay probabil-
ity and generating an explanation graph in the first round, all the task nodes that are
directly linked to the target task node are chosen. Among the selected task nodes,
all the attributes of the task nodes that are excluded from the explanation graph are
manipulated with small values. The manipulation values are calculated by following
this equation: .min((Tt ) + 1) ∗ 0.001 where .Tt is an attribute value and min is the
minimum function which returns the minimum value of the attribute from all the
task nodes in the dataset. For example, if the attribute’s minimum value is 0, then
0.001 is returned as the manipulation value. This manipulated dataset is used for
the second round. If both the prediction and explanation are similar to the first
round’s, then our model is stable against trivial changes, and unstable otherwise.
The summary of the experiment is illustrated in Fig. 4. This experiment was run
30 times by selecting different delayed test task nodes in every dataset with three
different thresholds: Q3, .Q3 + (Q3 − Q1), and .Q3 + (Q3 − Q1) ∗ 1.5.
In order to measure the stability, four metrics are applied. Prediction stability
measures the matching rate of predictions between the first and second rounds.
The higher the value becomes, the more stable the model prediction is against
irrelevant values. Therefore, we expect to observe a higher prediction stability value.
We also performed the Wilcoxon signed-rank test to statistically validate the delay
probability changes. The raw predictions are expressed from zero to one, and as
the value becomes closer to one, the likelihood of delay increases. Prior to the
experiment, we propose two hypotheses.
310 S. Tomura and H. K. Dam
Fig. 4 The red dot is the target delay node, and the dotted line is the explanation graph for
the prediction. (a) Non-explanatory components are manipulated with small values (b’). If the
explanation is stable, then the raw delay probability and the explanation should be similar to the
first round’s. (b”) Otherwise, the prediction and explanation would not be stable
– .H0 : there are no differences between the raw predictions from the first and second
rounds.
– .H1 : there are differences between the raw predictions from the first and second
rounds.
If we fail to reject the null hypothesis .H0 , then in practical terms, we can assume
that the predictions in the two rounds are the same. We use Cohen’s effect size
to statistically measure the value difference between the first and second rounds.
The value in this experiment is expected to be small. Explanation stability is the
metric that indicates the similarity level of explanations between the two rounds.
This metric is calculated as below:
( ' '
)
sum nodesf irst + nodessecond
.Estability = ( ) (9)
sum nodesf irst + nodessecond
where .nodesf irst and .nodessecond represent each task node in the first and second
explanation graphs, respectively. The denominator denotes the total task nodes in the
first and second explanation graphs. The numerator represents the total task nodes
selected in both rounds. This metric ranges from 0 to 1. As the value becomes closer
to 1, a higher similarity is demonstrated.
The result of this experiment is shown in Table 2. .τ 1, .τ 2, and .τ 3 are the threshold
1, 2, and 3, respectively. .τ 0 is when no threshold is set, which is used for the total
edges. .∗ , .∗∗ , and .∗∗∗ show that the p-value is smaller than the significance level
at 0.05, 0.01, and 0.001, respectively. P stability, E stability, and T edges denote
prediction stability, explanation stability, and total edges, respectively.
Result Prediction stability higher than 90% is observed in most of the cases in
all the projects regardless of the thresholds. The high prediction stability values
show that there is no difference in the predictions before and after the manipulation.
The predictions are insusceptible to the minor and unimportant changes. This high
Generating Explanations for AI-Powered Delay Prediction in Software Projects 311
Apache 0.93 0.97 1.00 0.98 0.78 0.50 .−0.08 .−0.10 .−0.42 0.75 0.79 0.79 47
Duraspace 0.90 0.87 0.93 0.69 0.84 0.93 .−0.02 .−0.10 .−0.01 0.82 0.85 0.94 89
JBoss 0.93 0.90 0.80 0.77 0.87 0.86 0.12 0.09 .−0.12 0.70 0.61 0.59 380
Moodle 0.80 0.93 0.90 0.58 0.63 0.49 0.12 .−0.10 .−0.07 0.73 0.78 0.79 144
Spring 0.97 1.00 0.97 0.35 0.22 0.77 .−0.06 0.06 0.02 0.86 0.88 0.89 85
Average 0.91 0.93 0.92 0.67 0.67 0.70 0.01 0.00 .−0.12 0.77 0.78 0.80 149
prediction stability is also confirmed by the Wilcoxon test and effect sizes. All the
p-values computed from the raw probabilities from each of the two rounds are by
far higher than any of the significance levels. The effect size values in all the cases
are close to zero. Hence, we fail to reject the null hypothesis, which meets our
expectation (high stability).
The stability also needs to be confirmed from the explanation side as well via
the explanation stability metric. Explanation stability is achieved at least 77% on
average in all the different thresholds. Similar explanations are returned to the
same task nodes. One noticeable tendency is that the stability value increases as
the threshold becomes higher except for JBoss. The average stability value shifts
from 80% to 85% when the threshold changes from .τ 1 to .τ 3 excluding JBoss.
Therefore, the threshold needs to be adjusted based on the expected stability level.
The threshold adjustment can be conducted by considering it in the aspect of fidelity
as well (which is extended in the discussion section). JBoss shows a different
tendency with rather low performance because of the large average total task edges
at 380. This might raise the difficulty of precisely generating similar explanations to
the model.
Fidelity Check
Fidelity check consists of two experiments, which correspond to answering our last
question:
RQ3. Do the explanations demonstrate the model’s actual reasons for predict-
ing if tasks are delayed or not?
(a) Non-explanatory delay components removal test
Method This experiment is decomposed into two rounds (see Fig. 5). The first
round predicts the raw delay probability of a delayed test task node, followed by
identifying explanatory task nodes and edges. All irrelevant edges connected to
the target node are eliminated by updating the adjacent matrix in the manipulation
process. The second round predicts the raw delay probability of the same test task
node again, but the updated adjacent matrix is used. After both rounds, we compare
312 S. Tomura and H. K. Dam
Fig. 5 (a) After the first round, the task edges of non-explanatory components connected to the
target task node are removed. (b) If the explanation satisfies fidelity, then the removal does not
change the prediction. (b”) If the prediction significantly changes, then our method does not satisfy
fidelity
Apache 0.91 0.87 0.91 0.24 0.34 0.11 .−0.07 .−0.09 0.04
Duraspace 0.93 0.87 0.93 0.93 0.44 0.32 0.02 0.01 0.17
JBoss 0.90 0.90 0.93 0.96 0.54 0.88 .−0.07 .−0.02 .−0.15
Moodle 0.97 0.93 0.93 0.27 0.28 0.36 .−0.09 .−0.09 0.02
Spring 0.93 0.80 0.93 0.94 0.12 0.81 .−0.12 .−0.11 0.03
Average 0.93 0.87 0.93 0.67 0.34 0.50 .−0.06 .−0.06 0.02
the first and second prediction results. If the manipulation does not change the
result, then the explanation entails high fidelity. The prediction consistency shows
that the components of the explanation graph are the actual factors influencing
the result. Similar to the previous experiments, this experiment was also run 30
times by randomly selecting delayed tasks in test data. To check the effect of the
threshold for node and edge filtering, we also used three different thresholds: Q3,
.Q3 + (Q3 − Q1), and .Q3 + (Q3 − Q1) ∗ 1.5.
Result More than 90% of fidelity is achieved in the majority of the cases. The
removal of non-explanatory components did not affect the predictions in all datasets.
This high fidelity is statistically confirmed by the Wilcoxon test result and effect size
as well. All the p-values are larger than all the significance levels even if the lowest
threshold (.τ 1) is selected. Hence, we failed to reject .H0 in all the cases. The effect
size from all the cases is close to zero as well. The results from this experiment
allow us to conclude that our method is capable of correctly including explanatory
components inside the explanation graph. However, it does not provide sufficient
evidence to demonstrate the high fidelity since the actual influence level of the
explanatory components is not tested. Therefore, we conducted another experiment
that checked the effect of these components.
(b) Explanatory delay components manipulation test
Method This is also a two-round experiment (see Fig. 6). The first round predicts
the raw delay probability of a delayed test task node and extracts explanatory task
nodes and edges. The attributes in the selected task nodes are manipulated if the
values are above the corresponding attribute threshold. Their values are adjusted
to become just below the threshold value. In addition, if these selected task nodes’
labels are delayed, then their labels are changed into non-delay. After removing all
the delay factors, the model predicts the delay probability for the same target task
nodes by using the manipulated data. We compare the raw delay probability from the
first round to the second round. If the explanatory task nodes in the subgraph actually
affect the prediction, then the raw probability should become closer to zero in the
second round. This is because all the factors explaining the delay are eliminated.
Otherwise, the explanation would not satisfy the fidelity.
Three metrics are used in this experiment. The change rate is the rate of the
number of cases whose raw delay prediction dropped between both rounds. A higher
value shows that more cases observed probability drops. p-value from the Wilcoxon
signed-rank test is another metric that statistically demonstrates the raw probability
difference between the two rounds. The null and alternative hypotheses are identical
to the previous experiment. The significant difference is that we expect p-values
Fig. 6 (a) All delay factors are removed from the subgraph. (b’) If the explanation satisfies fidelity,
then the raw probability becomes lower than the first round. (b”) If not, then the probability does
not change
314 S. Tomura and H. K. Dam
Apache 0.63 0.57 0.63 0.00*** 0.39 0.36 0.37 0.12 0.11
Duraspace 0.75 0.52 0.50 0.00*** 0.10* 0.26 0.18 0.11 .−0.04
JBoss 0.83 0.47 0.73 0.00*** 0.91 0.01** 0.77 .−0.13 0.15
Moodle 0.8 0.53 0.53 0.10 0.25 0.53 0.64 0.50 0.25
Spring 0.83 0.5 0.57 0.00*** 0.18 0.18 0.53 0.30 0.11
Average 0.75 0.52 0.50 0.02* 0.35 0.27 0.50 0.18 0.12
to become smaller than a significance level, which leads to the rejection of the
null hypothesis. This proves that manipulating selected components alters the raw
probability, and thus they are actually important for the prediction. The expectation
of effect size is also opposite to the previous experiments. Since the values from the
first and second should be significantly different, the effect size needs to be large.
Table 4 shows the results of this experiment. .τ 1, .τ 2, and .τ 3 are the same thresholds
1, 2, and 3 as in the previous experiments. The marks .∗ , .∗∗ , and .∗∗∗ also denote the
same significance levels.
Results The change rate demonstrates significant negative values at the lowest
threshold with 75% on average. However, if the threshold is too high, substantial
delay probability reduction cannot be observed by the rate significantly becoming
small at 50%. This is because a number of explanatory components are also filtered.
They can be influential to the delay although they are not encompassed in the
subgraph. The target nodes still receive delay influences from these components,
and the raw delay probability remains at a similar level as a consequence.
The p-values and effect sizes capture this tendency as well. At .τ 1, almost
all the projects are able to reject the null hypothesis at the level of 0.05 as the
maximum significant level. However, as the threshold increases, all the projects
except Duraspace and JBoss cannot reject the null hypothesis at any significance
level. The effect size in .τ 1 is 0.5, but as the threshold becomes higher, the value
becomes closer to zero. Therefore, the selection of the optimal threshold is also
important in the view of fidelity so that the null hypothesis is rejected. These two
experiment results show that by selecting the optimal threshold, our method can
satisfy fidelity. Our method includes truly explanatory components in the generated
explanation, and they are influential to the predictions.
6 Conclusions
In this paper, we proposed a new method that is capable of more accurate delay
prediction and generating explanations for the prediction. The experiment results
demonstrate that our approach achieved 68% precision, 84% recall, and 74% F-
Generating Explanations for AI-Powered Delay Prediction in Software Projects 315
measure on average. In addition, the explanations generated from our model also
achieved high stability and fidelity, indicating the effectiveness of our approach.
7 Data Availability
All the code and data used in this chapter are available at https://zenodo.org/record/
6999325#.YvxsjHZBxyw. The datasets were obtained from [6] and published at
https://github.com/morakotch/datasets/tree/master/delayed%20issues/ASE2015.
References
1. Alvarez-Melis, D., Jaakkola, T.S.: A causal framework for explaining the predictions of black-
box sequence-to-sequence models. arXiv:1707.01943 (2017)
2. Alvarez-Melis, D., Jaakkola, T.S.: Towards robust interpretability with self-explaining neural
networks. arXiv:1806.07538 (2018)
3. Ancona, M., Ceolini, E., Öztireli, C., Gross, M.: Towards better understanding of gradient-
based attribution methods for deep neural networks. arXiv:1711.06104 (2018)
4. Brynjolfsson, E., Mitchell, T.: What can machine learning do? Workforce implications. Science
358, 1530–1534 (2017)
5. Choetkiertikul, M., Dam, H.K., Ghose, A.: Threshold-based prediction of schedule overrun
in software projects. In: Proceedings of the ASWEC 2015 24th Australasian Software
Engineering Conference, pp. 81–85 (2015)
6. Choetkiertikul, M., Dam, H.K., Tran, T., Ghose, A.: Predicting delays in software projects
using networked classification. In: 2015 30th IEEE/ACM International Conference on Auto-
mated Software Engineering (ASE) (2015)
7. Choetkiertikul, M., Dam, H.K., Trany, T., Ghose, A.: Characterization and prediction of issue-
related risks in software projects. In: 2015 IEEE/ACM 12th Working Conference on Mining
Software Repositories, pp. 280–291 (2015)
8. Choetkiertikul, M., Dam, H.K., Tran, T., Ghose, A.: Predicting the delay of issues with due
dates in software projects. Empirical Software Eng. 22(3), 1223–1263 (2017)
9. Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very large networks.
arXiv:cond-mat/0408187 (2004)
10. da Costa, D.A., McIntosh, S., Kulesza, U., Hassan, A.E., Abebe, S.L.: An empirical study of
the integration time of fixed issues. Empirical Software Eng. 23(1), 1–50 (2018)
11. Frese, R.: Project success and failure: What is success, what is failure, and how can
you improve your odds for success? (2003). Retrieved from http://www.umsl.edu/~sauterv/
analysis/6840_f03_papers/frese/
12. Group, T.S.: The CHAOS Report. The Standish Group International, Inc., Boston (2015)
13. Guidotti, R., Ruggieri, S.: On the stability of interpretable models. In: 2019 International Joint
Conference on Neural Networks (IJCNN) (2019)
14. Hooker, S., Erhan, D., Kindermans, P.J., Kim, B.: A benchmark for interpretability methods in
deep neural networks. arXiv:1806.10758 (2019)
15. Jacovi, A., Goldberg, Y.: Towards faithfully interpretable NLP systems: how should we define
and evaluate faithfulness? arXiv:2004.03685 (2020)
16. Ji, C., Wang, R., Wu, H.: Perturb more, trap more: understanding behaviors of graph neural
networks. Neurocomputing 493(7), 59–75 (2022)
17. Kapishnikov, A., Bolukbasi, T., Viégas, F., Terry, M.: XRAI: better attributions through regions.
arXiv:1906.02825 (2019)
316 S. Tomura and H. K. Dam
18. Khan, K., Zafar, A., Alnuem, M., Khan, H.: Investigation of time delay factors in global
software development. World Acad. Sci. Eng. Technol. Open Science Index 63, International
Journal of Computer and Information Engineering 6(3), 318–326 (2012)
19. Kindermans, P.J., Schütt, K., Müller, K.R., Dähne, S.: Investigating the influence of noise and
distractors on the interpretation of neural networks. arXiv:1611.07270 (2016)
20. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks.
arXiv:1609.02907 (2016)
21. Lin, C., Sun, G.J., Bulusu, K.C., Dry, J.R., Hernandez, M.: Graph neural networks including
sparse interpretability. arXiv:2007.00119 (2020)
22. Michael, B., Blumberg, S., Laartz, J.: Delivering large-scale IT projects on time, on budget,
and on value. Technical report (2012)
23. Munns, A.K., Bjeirmi, B.F.: The role of project management in achieving project success. Int.
J. Project Manage. 14, 81–87 (1996)
24. Newman, M.E.J.: Community detection in networks: modularity optimization and maximum
likelihood are equivalent. arXiv:1606.02319 (2016)
25. Nguyen, T.H.D., Adams, B., Hassan, A.E.: Studying the impact of dependency network
measures on software quality. 2010 IEEE International Conference on Software Maintenance,
1–10 (2010). https://doi.org/10.1109/ICSM.2010.5609560
26. Plumb, G., Molitor, D., Talwalkar, A.: Model agnostic supervised local explanations.
arXiv:1807.02910 (2019)
27. Plumb, G., Al-Shedivat, M., Cabrera, A.A., Perer, A., Xing, E., Talwalkar, A.: Regularizing
black-box models for improved interpretability. arXiv:1902.06787 (2020)
28. Pope, P.E., Kolouri, S., Rostami, M., Martin, C.E., Hoffmann, H.: Explainability methods
for graph convolutional neural networks. In: Proceedings/CVPR, IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (2019)
29. Rathee, M., Zhang, Z., Funke, T., Khosla, M., Anand, A.: Learnt sparsification for interpretable
graph neural networks. arXiv:2106.12920 (2021)
30. Ribeiro, M.T., Singh, S., Guestrin, C.: “why should i trust you” explaining the predictions of
any classifier. arXiv:1602.04938 (2016)
31. Sanchez-Lengeling, B., Wei, J., Lee, B., Reif, E., Wang, P., Qian, W., McCloskey, K., Colwell,
L., Wiltschko, A.: Evaluating attribution for graph neural networks. Advances in Neural
Information Processing Systems 33 (NeurIPS 2020) (2020)
32. Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: Proceedings
of the 34th International Conference on Machine Learning (2017)
33. van Genuchten, M.: Why is software late? An empirical study of reasons for delay in software
development. IEEE Trans. Software Eng. 17(6), 582–590 (1991)
34. Velmurugan, M., Ouyang, C., Moreira, C., Sindhgatta, R.: Developing a fidelity evaluation
approach for interpretable machine learning. arXiv:2106.08492 (2021)
35. Vu, M.N., Thai, M.T.: PGM-explainer: probabilistic graphical model explanations for graph
neural networks. arXiv:2010.05788 (2020)
36. Walfish, S.: A review of statistical outlier methods. Pharm. Technol. 30(11), 82 (2006)
37. Ying, R., Bourgeois, D., You, J., Zitnik, M., Leskovec, J.: GNN explainer: a tool for post-hoc
explanation of graph neural networks. Preprint. arXiv:1903.03894 (2019)
38. Ying, R., Bourgeois, D., You, J., Zitnik, M., Leskovec, J.: GNNExplainer: generating expla-
nations for graph neural networks. In: Advances in Neural Information Processing Systems
(2019)
39. Yuan, H., Yu, H., Gui, S., Ji, S.: Explainability in graph neural networks: a taxonomic survey.
Preprint. arXiv:2012.15445 (2021)
40. Zhang, J., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neural attention by excitation
backprop. In: European Conference on Computer Vision (2016)
41. Zheng, S., Song, Y., Leung, T., Goodfellow, I.: Improving the robustness of deep neural
networks via stability training. arXiv:1604.04326 (2016)
42. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for
discriminative localization. arXiv:1512.04150 (2015)
Classifying User Intent for Effective
Prompt Engineering: A Case
of a Chatbot for Startup Teams
1 Introduction
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 317
A. Nguyen-Duc et al. (eds.), Generative AI for Effective Software Development,
https://doi.org/10.1007/978-3-031-55642-5_15
318 S. Mohsenimofidi et al.
gain the ability to tailor prompts that align with the user’s expectations, eliciting
responses that transcend mere linguistic accuracy to encompass contextual relevance
and coherence.
Despite the central role user intent plays in shaping effective prompts, the
existing literature on prompt engineering sheds little light on this essential com-
ponent. The complexity and nuances involved in deciphering and categorizing user
intent can be attributed to ambiguous queries, diverse language usage, and varied
communication styles [4]. The scarcity of studies on user intent within the context
of prompt engineering underscores a knowledge gap that needs to be tackled to
harness the full potential of LLMs.
Our study aspires to address this gap and contribute to effective prompt engineer-
ing for LLMs. The research question that guides our research is:
How to identify the intent of a user’s query as part of prompt engineering for
LLMs using ChatGPT?
Utilizing a practical example centered around the development of a chatbot
tailored for startup teams seeking improved responses from ChatGPT, we illuminate
a feasible approach to classifying user intent automatically using ChatGPT itself.
The chatbot we developed, called DAS (Digital Assistant for Startups), intends
to automate the prompt engineering process for startup teams so that they can
benefit from improved interactions with LLMs. User intent classification is one key
component of DAS. Through this specific application scenario, we demonstrate how
user intent can be classified as part of prompt construction, thus contributing to the
evolving field of prompt engineering for LLMs. Our study also presents practical
implications for the effective utilization of prompt engineering in the context of
startups.
These principles are also reflected in prompt patterns [10]. Prompt patterns are
reusable solutions to the raised problems while interacting with LLMs. The concept
and inspiration behind prompt patterns are derived from software patterns. Just
as software patterns offer a structured method to address challenges in software
development, prompt patterns offer a systematic approach to customize the output
and interaction with LLMs. The main motivation behind prompt patterns is that such
patterns can be used to enhance the discipline of prompt engineering. These patterns
can be grouped into five categories: input semantics, output customization, prompt
improvement, error identification, interaction, and context control. What remains
unclear is how to decide which specific pattern is the most appropriate one to be
applied to a user’s query.
A comprehensive guide of prompt engineering for ChatGPT [5] indicates a way
to address this concern. The guide covers various aspects of prompt engineering
for ChatGPT, including its fundamentals, effective techniques, best practices, and
advanced strategies. Leveraging System 1 and System 2 questions is introduced
as one technique for effective prompt crafting. According to the author, under-
standing the difference between System 1 and System 2 questions can improve
the quality of ChatGPT’s responses. System 1 questions typically require quick,
intuitive, or pattern-recognition-based answers, while System 2 questions involve
more deliberate, analytical, or complex problem-solving. Tailoring prompts to
the type of question being asked will allow one to optimize the interaction and
maximize the usefulness of the generated output. The guide does not suggest any
specific classification of user questions, but it does emphasize the importance of
understanding user intent, that is, the user’s purpose of interacting with ChatGPT. It
helps craft a prompt that aligns with the user’s expectations.
However, as far as the authors are aware of, there is a scarcity of studies on user
intent within the context of prompt engineering for LLMs. Our study intends to fill
this knowledge gap.
The study presented in this chapter is part of a research project to develop a chatbot
for startups. Our eventual goal is to help startup teams benefit from LLMs by
reducing the efforts required from them to learn prompt engineering. Figure 1 shows
the architecture and process of the chatbot.
As shown in Fig. 1, the chatbot is composed of two parts: a prompt book and
a prompt engine. The prompt engine is the intermediary between the user and an
LLM, engineering the original query from a user to effective prompts using the
prompt book. In this way, the user can have natural and intuitive interactions with
ChatGPT without learning how to do prompt engineering. By controlling the flow
of information and processing user queries, the chatbot streamlines the interaction
process and improves the user experience with an LLM.
320 S. Mohsenimofidi et al.
User
6 1
Chatbot
2
Prompt Engine Prompt Book
3
5 4
LLM
In the center of the chatbot is what we called the “prompt book.” The inspiration
for designing the prompt book originates from the exploration of prompt patterns
proposed in the paper [10]. The prompt book is composed of three parts: (1) a set of
prompt patterns and the corresponding templates for generating prompts; (2) the list
of user intent types that can be used to classify a user’s query; and (3) the matching
between the types of user intent and the prompt patterns.
In our previous work [9], we evaluated the prompt patterns suggested in [10] to
understand their applicability and how to adapt them in the context of startups. The
selected and adapted prompt patterns can be found online.1 In this chapter, we focus
on the second element of the prompt book and the list of user intent types that can
be used to classify the queries from a startup team.
1 https://figshare.com/s/feef2d27953be1188093
Classifying User Intent for Effective Prompt Engineering: A Case of a Chatbot. . . 321
Define
the types of Four types of
Literature
user intent user intent
in startup context
ChatGPT generated
questions (1)
Validate Validated
the types of four types of
user intent user intent
Survey of
a startup course
students
Manually generated
questions
ChatGPT generated
questions (2)
A subset of ChatGPT
generated questions
(2)
Evaluate
ChatGPT generated the implementation
Evaluation result
questions (1) of
user intent classification
Questions collected
from the survey
Since the chatbot was meant for supporting startup teams, our first attempt was
using the core pillars of startups defined in [6]: product, team, finance, and market.
However, we came to the realization quickly that this classification scheme is
irrelevant to our need to choose appropriate prompt patterns. Instead the purpose, or
the user intent, behind an asked question seems to be more relevant in choosing
which prompt pattern to use. We need a classification scheme to categorize the
intent of a startup team when they interact with an LLM. We reviewed literature
purposefully to find the ways to classify the queries of startup teams when
interacting with mentors or domain experts. As a result of this step, we identified
four types of intent that could be used to classify the queries from startup teams.
We used OpenAI’s API for ChatGPT to implement the classification of user intent
in an automatic manner so that it will become part of the chatbot functionality. This
2 https://figshare.com/s/feef2d27953be1188093
Classifying User Intent for Effective Prompt Engineering: A Case of a Chatbot. . . 323
option can also keep the stack of technology required to implement the chatbot
to a minimum. The recent studies on prompt engineering and intent classification
using ChatGPT suggested that synthetic data generated by ChatGPT can be used,
and few-shot prompting can achieve relatively good results when real data is scarce
[2, 7, 8]. Therefore, we generated 4 questions per user intent type manually and then
prompted ChatGPT to generate additional 26 questions per user intent type. In total,
we prepared 120 questions. These 120 questions are used in the Python script for
user intent classification. The 120 questions, the prompts used to generate 104 of
them, and the Python script are shared online.3
5 Results
Four types of user intent are included in the prompt book: seeking information,
seeking advice, brainstorming, and reflecting on own experience. Their definitions
are provided in Table 1.
3 https://figshare.com/s/feef2d27953be1188093
4 https://figshare.com/s/feef2d27953be1188093
324 S. Mohsenimofidi et al.
The result of manually classifying the 100 ChatGPT generated questions shows
that all questions can be categorized using the types defined in Table 1. The majority
(57) are classified under seeking advice, 35 under seeking information, and 8 are
brainstorming questions. None of these questions is of reflective intention.
The result from the conducted survey shows somehow different patterns. We
received 49 responses from the survey with the Finnish graduate students at the
beginning of their startup course in October 2023. These students came from
different faculties including IT, business and management, and finance. The class
has a good balance of gender and nationality, and most of them are in their 20s to
early 30s. Figure 3 shows the levels of their knowledge of startups at the beginning
of the course, from 1 (very little) to 9 (very knowledgeable).
As shown in Fig. 3, the majority of the class believe that they already have
certain levels of startup knowledge, with a few of them (8) considering themselves
knowledgeable or very knowledgeable on the topic. The median self-assessed
startup knowledge level of the respondents is 4.0. The mean score is approximately
4.57. Therefore, the collective self-assessed knowledge level is below the midpoint
on the scale used in the survey.
Figure 4 shows their attitudes toward using a chatbot to support their startup-
building processes, which are mostly positive, with 51% of the respondents
answering “Yes” and 46.9% saying “Maybe.” Interestingly, for the respondents
Classifying User Intent for Effective Prompt Engineering: A Case of a Chatbot. . . 325
who answered “Yes,” their mean startup knowledge level is approximately 4.96,
higher than that of the “Maybe” group (approximately 4.26). This suggests that the
students who are more open to using a chatbot for startup support (“Yes”) tend to
rate their startup knowledge slightly higher than those who are uncertain (“Maybe”).
However, the difference in the levels of self-assessed startup knowledge between
the two groups (“Yes” vs. “Maybe”) is not statistically significant, according to the
independent t-test (approximately 1.35 with p-value of about 0.183, higher than the
common threshold of 0.05).
Regarding the evaluation of the types of user intent, 47 responses were received
to the first evaluation question, For what purposes would you like to use a chatbot?,
which is a multiple-choice one based on the four types of user intent as defined in
Table 1. As shown in Fig. 5, among the four types of user intent, brainstorming is the
most frequently chosen type, followed by seeking information and seeking advice.
Reflecting on own experience is less often in comparison to the other types but still
considered relevant by some respondents.
326 S. Mohsenimofidi et al.
6 Concluding Remarks
In this chapter, we focused on the classification of user intent, one important facet
of a nuanced process of crafting effective prompts for ChatGPT. The presented
work is part of the design process of a chatbot to support the startup teams. The
defined types of user intent are a key element of the prompt book that underpins
the implementation of the chatbot. We evaluated the defined user intent types
using both synthetic startup questions generated by ChatGPT and a survey of the
graduate students attending a startup course at a Finnish university. Additionally,
we demonstrated how to automate the classification of user intent using ChatGPT
itself and evaluated the effectiveness of reliability of the classification results.
The primary contribution of this research lies in the approach we described
to identify user intent behind the interactions with LLMs, which is a critical
aspect often overlooked in prompt engineering for LLMs. The development of
the chatbot stands as an illustration of this approach, offering startup teams an
intuitive and efficient method to engage with LLMs without requiring extensive
knowledge in prompt engineering. Through the example of developing the prompt
book that underpins the chatbot, we demonstrated how user intent can be defined
and implemented.
Our findings indicate that the classification of user intent into four categories—
seeking information, seeking advice, reflecting on own experience, and
brainstorming—provides a good framework for selecting appropriate prompt
328 S. Mohsenimofidi et al.
There are several limitations in our study. Firstly, it primarily focused on the startup
environment, which may not fully represent the diverse range of scenarios where
LLMs like ChatGPT are applied. However, we do believe that the methodology
and insights gained from this study can be adapted to various other domains
where LLMs are utilized. Future work could explore the application of this user
intent classification framework in different sectors, such as education, healthcare, or
customer service.
Another limitation of our study is that the evaluation of the user intent categories
used a small sample of students in one course in one university, which limits the
generalizability of the evaluation results to the general population of startup teams.
Therefore, a larger sample of startup teams would provide more valid evaluation
results.
We also recognized the inherent challenges in training LLMs to accurately
interpret and respond to nuanced human queries, a task that often requires con-
tinuous refinement and adaptation. There is a significant opportunity to refine the
classification model by incorporating more dynamic and context-aware algorithms,
potentially integrating advanced natural language processing techniques.
Lastly, the study focused on a specific LLM, which is ChatGPT, and this may
limit the generalizability of the findings to other LLMs. The research also relied
heavily on the accuracy and reliability of ChatGPT, which, although state of the art,
is not devoid of potential biases or errors.
We are in the process of implementing the complete chatbot based on the research
presented in this chapter and in our previous work [9]. We planned two steps of
evaluation that allow us to evaluate the chatbot, including the results presented in
this chapter, in a more systematic manner: (1) small-scale evaluation with several
startup teams and (2) large-scale quantitative evaluation through the implemented
chatbot.
In the small-scale evaluation, we will sit side by side with the selected startup
teams and administrate their interactions with ChatGPT, first using standard Chat-
GPT without specific prompting, and then we guide the startup teams to use our
chatbot. We will collect all the data generated in the process, including the questions
asked, the user intent classifications, the generated prompts, and the responses from
both standard ChatGPT and from our chatbot. We will use the generated data to
Classifying User Intent for Effective Prompt Engineering: A Case of a Chatbot. . . 329
validate the content in the prompt book, including the user intent classification.
We envision that the classification of user intent as well as the templates of the
selected prompt patterns will be modified, extended, and improved after the small-
scale evaluation.
For the large-scale evaluation, we will make the chatbot accessible freely for
startup teams to use. With their permission, we will collect relevant data for research
purposes. Both prompt book and the usability of the chatbot will be evaluated and
improved using the data from a larger sample of startup teams.
Acknowledgments We would like to express our sincere thanks for the support we received from
the students who are involved in our research both at our university and from the University of
Jyväskylä, Finland.
References
1. Atlas, S.: ChatGPT for higher education and professional development: a guide to conversa-
tional AI (2023). Independently published
2. Bouzaki, A.G.: Enhancing intent classification via zero-shot and few-shot ChatGPT prompting
engineering: generating training data or directly detecting intents. Master thesis. NATIONAL
AND KAPODISTRIAN UNIVERSITY OF ATHENS (2023)
3. Clavié, B., Ciceu, A., Naylor, F., Soulié, G., Brightwell, T.: Large language models in
the workplace: a case study on prompt engineering for job type classification. Preprint.
arXiv:2303.07142 (2023)
4. Dhole, K.D.: Resolving intent ambiguities by retrieving discriminative clarifying questions.
Preprint. arXiv:2008.07559 (2020)
5. Ekin, S.: Prompt Engineering for ChatGPT: A Quick Guide to Techniques, Tips, and Best
Practices. Preprint (May 2023). https://doi.org/10.36227/techrxiv.22683919.v2, https://www.
techrxiv.org/doi/full/10.36227/techrxiv.22683919.v2
6. Giardino, C., Bajwa, S.S., Wang, X., Abrahamsson, P.: Key challenges in early-stage software
startups. In: Agile Processes in Software Engineering and Extreme Programming: 16th
International Conference, XP 2015, Helsinki, May 25–29, 2015, Proceedings 16, pp. 52–63.
Springer, Berlin (2015)
7. Tang, R., Han, X., Jiang, X., Hu, X.: Does synthetic data generation of LLMs help clinical text
mining? Preprint. arXiv:2303.04360 (2023)
8. Ubani, S., Polat, S.O., Nielsen, R.: ZeroShotDataAug: generating and augmenting training data
with ChatGPT. Preprint. arXiv:2304.14334 (2023)
9. Wang, X., Idris, M.A., Rafiq, U., Hubner, S.: Turning large language models into AI
assistants for startups using prompt patterns. In: International Conference on Agile Software
Development, pp. 22–26. Springer International Publishing, Cham (2023)
10. White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith,
J., Schmidt, D.C.: A prompt pattern catalog to enhance prompt engineering with ChatGPT.
Preprint. arXiv:2302.11382 (2023)
11. Zhang, J., Bui, T., Yoon, S., Chen, X., Liu, Z., Xia, C., Tran, Q.H., Chang, W., Yu, P.: Few-
shot intent detection via contrastive pre-training and fine-tuning. In: Moens, M.F., Huang,
X., Specia, L., Yih, S.W.t. (eds.) Proceedings of the 2021 Conference on Empirical Methods
in Natural Language Processing, pp. 1906–1912. Association for Computational Linguistics,
Online and Punta Cana (2021). https://doi.org/10.18653/v1/2021.emnlp-main.144, https://
aclanthology.org/2021.emnlp-main.144
Toward Guiding Students: Exploring
Effective Approaches for Utilizing AI
Tools in Programming Courses
Mika Saari, Petri Rantanen, Mikko Nurminen, Terhi Kilamo, Kari Systä,
and Pekka Abrahamsson
Abstract This study explores the role of artificial intelligence (AI) in higher
education, with a focus on the teaching of programming. Despite the growing use
of AI in education, both students and teachers often struggle to understand its
role and implications. To address this gap, we conducted surveys on two different
university programming courses to assess the experiences and perspectives of
over 200 students on the use of AI in programming education. Combined, these
findings underscored the need for guidance on how students should use AI tools.
Furthermore, the findings suggest that AI is becoming increasingly integrated into
university education, especially in programming courses. Responding to this need,
we extend the contribution of the study by introducing a set of best practices for
AI tool usage in programming courses. Overall, the study highlights the need for
greater awareness and understanding of AI in university teaching and the fact that
teachers have an important role to play in providing guidance to students on the
responsible use of AI tools.
1 Introduction
AI-assisted teaching and learning, e.g., the use of conversational style language
models like ChatGPT,1 has the potential to revolutionize the way programming
courses are taught. With artificial intelligence (AI), students can receive tailored,
personalized, and interactive support, receive immediate feedback, and have access
1 https://chat.openai.com
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 331
A. Nguyen-Duc et al. (eds.), Generative AI for Effective Software Development,
https://doi.org/10.1007/978-3-031-55642-5_16
332 M. Saari et al.
to a virtual tutor 24/7. AI can also analyze the students’ performance and provide
suggestions for improvement, making the learning experience more efficient and
effective. Additionally, AI can automate tedious tasks such as grading code, freeing
up instructors to focus on more important tasks such as providing meaningful
feedback and engaging in interactive discussions. The role of AI-assisted learning
in higher education (HE) can be viewed both from the perspective of the learner and
the teachers.
This study is an extension of our earlier study [14], which focused on the
students’ point of view, based on a survey of BSc-level programming students. In
that paper, it was investigated how much students already use AI to support their
learning of programming and for what purposes. The survey incorporated both
multiple-choice items and open-ended questions to gather data. Responses to the
open-ended questions were systematically categorized using relevant keywords. The
results were presented as percentage distributions and in the form of bar charts [14].
In this study, we used additional data from another programming course. The
collected data consists of students’ written feedback of how they use AI tools when
they program the exercises related to the course.
Based on the collected data, this chapter presents best practices for teachers on
how to guide students in the use of AI. Thus, the following research question was
formulated:
RQ: What are the best practices for including AI in programming courses?
The rest of the chapter is structured as follows: Sect. 2 clarifies the research
environment, including recent AI-related studies. Section 3 focuses on the research
methodologies, context, and data collection and analysis methods and also discusses
the findings of the surveys. Section 4 contains the proposed best practices and a
discussion of them. Also, the threats to validity and possible future research items
are discussed. Finally, Sect. 5 summarizes the research in relation to the research
question.
In the context of this research, the keyword “programming course” was used
for identification by conducting a search of relevant studies in the IEEE Xplore
database, resulting in a total of 16 publications. Three of these studies, namely,
[2, 5, 17], emphasized the role of instructional guidance during the learning process.
Question and answer sites, such as Stack Overflow, are often used for finding
help on programming-related tasks. How students use Stack Overflow was studied
in detail by Robinson [13]. Traditionally, search engines have been used to seek
further assistance for programming tasks. However, a new trend is emerging where
people are turning to AI-based solutions. These solutions may take the form of
virtual assistants or chatbots, such as Siri, Alexa, or Google Assistant [8]. The
increasing popularity of AI-assisted learning, utilizing chatbots as a means of
Toward Guiding Students: Exploring Effective Approaches for Utilizing AI. . . 333
2 https://ieeexplore.ieee.org/Xplore
3 For example, https://www.tuni.fi/en/students-guide/handbook/uni/studying-0/academic-
integrity-students/use-ai-based-applications https://www.ed.ac.uk/bayes/ai-guidance-for-staff-
and-students https://oue.fas.harvard.edu/ai-guidance
334 M. Saari et al.
The process of our research is illustrated in Fig. 1. In our previous study [14],
we performed a survey (Round 1, in Fig. 1), for students on a basic university
programming course and analyzed its results. The key findings of the earlier study
are included as one of the starting points for this study. We supplemented these
findings by executing another survey in a mobile programming course, which is
generally targeted to students who are further advanced in their studies as compared
to the basic course. A total of 223 students answered the surveys. The details of
the first survey are presented in [14], and the second is discussed in-depth in the
following subsections.
4 https://scholar.google.com/
5 https://ieeexplore.ieee.org
6 https://arxiv.org/
Toward Guiding Students: Exploring Effective Approaches for Utilizing AI. . . 335
Round 1
Key findings
AI usage research
Fig. 1 Process used in seeking best practices: flow of the tasks undertaken and their results
The results of the second survey were analyzed and categorized in an attempt to
find common issues, concerns, and comments reported by the students. The help of
the ChatGPT tool was used in formulating the categories, with the authors of this
work fine-tuning the results. The process of creating the categories is explained in
more detail in Sect. 3.2.
The key findings of the first survey, as well as the categorizations produced from
the second survey, were further analyzed to detect the challenges related to the use
of AI in programming courses, with the reasoning being that whatever problems the
students faced should be taken into consideration when formulating any guidelines
for the use of AI.
Finally, based on the identified challenges, best practices (guidelines) were
crafted, including suggestions given by the students participating in the surveys,
and complemented by the insight of the authors. The first and fourth authors of
this chapter worked as teachers on the programming course of the first survey,
the first and second authors on the mobile programming course of the second
survey, and with the others having long experience in teaching in general. Thus,
the mix of student feedback enriched by views from the teachers should provide a
comprehensive view on designing the best practice guidelines.
In our study [14] evaluating students’ perspectives on AI, several key findings
emerged. These insights provide an overview of the students’ attitudes toward the
integration of AI tools in their learning process:
A Interest in AI: There is interest in AI among newer students, suggesting its
inclusion in curricula could be beneficial.
B Resistance to AI: While AI is largely seen as a helpful tool for learning and
coding tasks, some students resist its use, either for unspecified reasons or a belief
that it makes tasks too easy.
C Difficulties in using AI: Initial difficulties with AI tools could result in discon-
tinuation, hinting at a need for more introductory guidance or support.
Implications from the First Round The findings suggest that while a majority of
students are open to utilizing AI tools in their academic pursuits, there is a need for
more structured introduction and guidance, especially to overcome initial obstacles.
This section introduces the data collection and analyses from the Mobile Program-
ming course. A noteworthy aspect of this course was the use of AI tools. Students
had the opportunity to use tools such as OpenAI’s ChatGPT and GitHub Copilot7
for programming tasks. In the following paragraphs, we will scrutinize the data
collected in this course and highlight the results and insights obtained from the
integration of AI tools.
Course Overview The university introduced a programming course on Mobile
Programming, corresponding to 5 ECTS (European Credit Transfer and Accumula-
tion System) points. The main objective of this course was practical programming
skills with a particular focus on mobile development. The course was optional in
several study programs.
Course Details The course included a comprehensive set of 18 programming
exercises that were designed to offer students a deep dive into the complexities and
nuances of mobile programming. The chosen environment for these exercises was
Android, necessitating the use of the Java programming language.
Integration of AI Tools In a unique approach, the course allowed, and even
encouraged, the utilization of AI tools to aid in the problem-solving process.
Specifically, students had the option to use ChatGPT, a conversational AI developed
by OpenAI, and GitHub Copilot, an AI-based programming tool that integrates with
common programming Integrated Development Environments (IDEs).
7 https://github.com/features/copilot
Toward Guiding Students: Exploring Effective Approaches for Utilizing AI. . . 337
D General utilization of AI: This category includes feedback that mentions the use
of AI in general for coding and learning support.
338 M. Saari et al.
General utilization of AI usually means that students use AI tools. For this, the
typical feedback items were as follows: “ChatGPT’s intelligence and ability to
understand a wide range of questions and requests helped me figure out how
to replace Copilot’s features in my own way,” and “AI-based assistants like
ChatGPT can provide equally valuable advice and guidance when it comes to
solving problems and developing programming skills.” Here, AI tools are seen
as a virtual assistant to whom you can direct questions related to the topic. The
number of feedback items in this category was three.
E Specific use cases: This category contains feedback that mentions specific
use cases of AI, including feedback referencing distinct applications of AI.
Feedback examples such as “I particularly utilized ChatGPT for assistance with
writing labor-intensive yet relatively clear code” include code troubleshooting,
setting permissions, using sensors, constructing strings, and creating varied code
samples. The number of feedback items in this category was four.
F Successes: This feedback focuses on solutions and tips provided by AI that were
helpful to the students.
This feedback category highlights the effective solutions and advice given by AI
that proved beneficial for learners. Here are two example comments by students:
“ChatGPT also helped me debug my code,” and “ChatGPT not only facilitated
my progress and augmented my knowledge, but I also inquired not just about
direct solutions but about the underlying theory behind the topics.” The number
of feedback items in this category was five.
G Problems and challenges: This category includes feedback in which students
highlight challenges they faced when using AI, such as incorrect code, deprecated
code, broken code, and compatibility issues with new libraries.
The problems and challenges segment gathers feedback from students pointing
out the obstacles they encountered when leveraging AI. Some of these challenges
include faulty code, outdated code, malfunctioning code, and incompatibility
with recent libraries. Here are some associated remarks: “I noticed that docu-
mentation was often better at resolving situations in problem cases than AI.” The
overall number of feedback items in this category was 3.
H Comparisons to other resources: This feedback compares the use of AI to
other sources, such as Google, Stack Overflow8 search results, and official
documentation such as the Android Developer site.9
As mentioned in one piece of feedback, “I observed that in problematic scenarios,
documentation often outperformed AI in providing solutions.” The number of
feedback items in this category was two.
I Criticism and suggestions: In this category, students suggest that the use of
AI should be better guided and that students should be critical of the answers
provided by AI.
8 https://stackoverflow.com/
9 https://developer.android.com/
Toward Guiding Students: Exploring Effective Approaches for Utilizing AI. . . 339
Within the criticism and suggestions section, learners advocate for more struc-
tured guidance when employing AI and emphasize the importance of being
discerning about the solutions provided by AI. To examine some specific
feedback: “ChatGPT isn’t some magical tool as one might assume; often, I
had to resort to entirely different solutions because ChatGPT couldn’t grasp the
situation.” “ChatGPT wasn’t always able to offer fully functional code examples.”
“While ChatGPT was generally useful, it surprisingly delivered erroneous code
in its responses.” The number of feedback items in this category was two.
In feedback analysis regarding AI utilization, students predominantly use tools
like ChatGPT for programming guidance. While these AI tools are valued for their
assistance in programming, there are areas for improvement. For optimal results,
they should be used alongside traditional resources.
immediate task but also resulted in wasted time and effort trying to rectify the
issue.
6. Complexity handling: Some students noted that while ChatGPT was good for
basic queries or general guidance, it struggled with intricate problems, especially
those related to integrating libraries or specific system configurations.
7. Gap in understanding context: AI models do not possess an inherent under-
standing of context. There were situations where the tool was unable to recognize
missing methods in a class or provide solutions based on the broader project
context, which a human tutor might be able to grasp.
8. Missing guides: Basic usage guides for commonly used tools were usually
missing.
The compilation presented above was generated by leveraging feedback obtained
from students. The primary objective was to pinpoint specific problematic areas.
Following the identification of these issues, appropriate solutions can be explored.
In this context, we will discuss further the elements instrumental in the identification
of these problem areas.
It has to be accepted that these are not magic tools and solutions to every problem.
While AI has its merits, there were times when it failed to aid students or provide
the answers they were seeking. The feedback raises several overall issues: there
is a need to integrate more AI-critical thinking into the education process. This
would help students in refining the code generated by AI, noticing its mistakes,
and understanding its limitations. Also, many have reported bugs in ChatGPT. In
addition, some believe that, at present, AI’s role and effectiveness in education are
rather overhyped.
AI tools, at times, offered incorrect or unusable code. As technology and
programming languages are evolving rapidly, AI’s knowledge might not always
keep pace with the most recent advancements or best practices. Some students, for
instance, pointed out that tools like ChatGPT lacked information about the latest
developments. This limitation affected the AI’s capability to deliver the best or most
pertinent solutions. Consequently, students found themselves investing extra time
in debugging or seeking alternative methods. Moreover, there were instances when
the AI suggested deprecated code. Furthermore, there were occasions where the AI
tool inadvertently led students down the wrong path. This misdirection not only
impacted the student’s immediate task but also led to unnecessary time and effort
spent trying to rectify the problem.
Some students observed that while ChatGPT is effective for straightforward
questions or general advice, it has difficulties with complex issues, particularly when
it comes to integrating libraries or dealing with specific system setups. A notable
gap in understanding context was evident, as AI models do not possess an inherent
understanding of context. There were situations where the tool could not recognize
missing methods in a class or provide solutions based on the broader project context,
which a human tutor might be able to grasp. Additionally, ChatGPT was not able to
provide completely functional code examples.
Toward Guiding Students: Exploring Effective Approaches for Utilizing AI. . . 341
The last challenge on the list does not only refer to AI tools. Basic guides for
frequently used tools are often absent. It appears that many courses, regardless of
the university or school, use similar tools. It is inefficient to teach the usage of
the same tools in every course. This is similar to the instruction of other common
tools, such as version control systems (e.g., git) or various IDEs. Moreover, the
teaching of these tools might not always be comprehensive. It might be expected
that students learn about these essential tools on their own. When dealing with
AI, there is an increased risk that students might misuse the tools unless proper
guidance is provided. However, dedicated research on this topic is required before
this assumption can be confirmed.
4 Discussion
In this section, conclusions are presented, namely, the best practices, to tackle the
challenges previously mentioned. The challenges are based on our earlier study and
the material studied here as shown in Fig. 1.
Initially, the results and observations from our earlier publication [14] were
utilized. These observations were based on feedback given by students on whether
they use AI as an aid in their studies. The survey touched lightly upon how AI tools
were used.
Additional data from the Mobile Programming course was incorporated into the
source material because there was a desire to determine concretely how students
used AI tools. In this context, students were encouraged to use AI, but, in turn, more
detailed documentation was required from them on how the tools were used.
Proposed
Findings
practices
A. Interest in AI
Challenges I
Introduction to AI
B. Resistance to AI: tools
1. Not a Magic Solution
C. Difficulties in using AI
2. Erroneous Outputs II
Sharing best
3. Deprecated Solutions practices
III. Critical approach: Emphasize that the code produced by AI is not always
flawless and that students should always test and check the code before
deploying it.
IV. Encourage use of other resources: Although AI is a useful tool, students
should also be encouraged to use other resources, such as official documen-
tation, when they face challenges.
V. It is only a tool: Students should remember that AI is specifically a tool and
the main purpose of studying (courses) is not to learn how to use a particular
tool but to understand the core content of the course.
In the listing above, the numbering does not indicate the order of importance, and
the numbers are only for referencing purposes.
The utilization of AI tools has become paramount for enhancing productivity
in both programming and educational contexts. An introductory training module
clarifies the fundamental principles and functionalities of visible AI instruments,
including but not limited to ChatGPT, Copilot, Bing Chat, and Bard. Emphasis is
placed on finding out the prerequisites for engagement with these technologies.
Concurrently, students familiarize ourselves with the instructions that define their
essential use cases. As an integral part of this curriculum, insights are provided into
AI tools to improve learning outcomes and hands-on experiences.
Toward Guiding Students: Exploring Effective Approaches for Utilizing AI. . . 343
Sharing Best Practices Construct a guide that incorporates best practices for
integrating AI into coding. Students should begin with AI at the start of the course.
It should start with simple AI examples and a demo, such as creating a “Hello
World” program in Java. As skills develop, the difficulty level should be increased
progressively. Questions should be encouraged with examples, like explaining the
differences between Integer and int.
It is essential to understand that the code generated by AI is far from perfect.
Students will always have to verify and test it before use. Students need to analyze
the code because the code might contain errors and need to be comprehended and
checked. Additionally, AI can sometimes produce entirely incorrect results, akin to
hallucinations.
While AI is a valuable asset, it is essential for students to diversify their sources
of information. This is because relying solely on AI as an information source is
not advisable for the reasons mentioned above. Besides, programming language
documentation can serve as an excellent alternative, and discussion forums or
programming communities, like Stack Overflow, provide invaluable insights and
knowledge.
Students have to recognize that AI is just one tool among many. The primary
goal of undertaking courses is not merely to become proficient with a specific tool
but to grasp the fundamental concepts of the subject.
nature of AI tools and educational methods, the replicability of the study’s findings
in different contexts or at different times might yield diverse results. Acknowledging
these threats to validity is essential for understanding the limitations of the study,
and under which conditions, the identified best practices for including AI in
programming courses may be most effective.
While this research has laid the foundation for understanding the effective incor-
poration of AI tools in programming education, there are still several directions to
consider for future studies. One of the foremost extensions to this research would
be to implement the proposed guidelines in real-world courses. This would provide
empirical evidence on the efficacy, practicality, and potential areas of refinement for
the guidelines. Practical trials would offer insights into any unforeseen challenges
that students or educators might encounter. Also, some of the results obtained in the
current study may reflect the specific processes and practices of Tampere University.
To ascertain the universality of the findings and to comprehend possible variations
in different educational settings, it would be beneficial to gather comparative data
from other universities or educational institutions.
Furthermore, as the scope of this study was limited to the domain of program-
ming education, it would be worthwhile to explore how these guidelines could be
extrapolated to other fields. Given their generalized nature, there is a theoretical
basis for believing that similar guidelines could be effective in diverse educational
disciplines. However, empirical investigations would be necessary to determine their
actual efficacy and whether potential modifications would be required for different
subjects.
5 Summary
AI. Furthermore, one should always keep in mind that AI is—in essence—a tool,
with its own limitations and possibilities, and not a silver bullet that can solve every
problem in existence. And, as with any tool, it is important to learn how to use it
properly.
References
1. Becker, B.A., Denny, P., Finnie-Ansley, J., Luxton-Reilly, A., Prather, J., Santos, E.A.:
Programming Is Hard - Or at Least It Used to Be, vol. 1, pp. 500–506. ACM, New York (2023).
https://doi.org/10.1145/3545945.3569759, https://dl.acm.org/doi/10.1145/3545945.3569759
2. Carreira, G., Silva, L., Mendes, A.J., Oliveira, H.G.: Pyo, a Chatbot Assistant for Introductory
Programming Students, pp. 1–6. IEEE, Piscataway (2022). https://doi.org/10.1109/SIIE56031.
2022.9982349, https://ieeexplore.ieee.org/document/9982349/
3. Chan, C.K.Y.: A comprehensive AI policy education framework for university teaching and
learning. Int. J. Educ. Technol. Higher Educ. 20(1), 38 (2023). https://doi.org/10.1186/s41239-
023-00408-3
4. Denny, P., Leinonen, J., Prather, J., Luxton-Reilly, A., Amarouche, T., Becker, B.A., Reeves,
B.N.: Promptly: using prompt problems to teach learners how to effectively utilize AI code
generators (2023). http://arxiv.org/abs/2307.16364
5. Ismail, M., Ade-Ibijola, A.: Lecturer’s Apprentice: A Chatbot for Assisting Novice Pro-
grammers, pp. 1–8. IEEE, Piscataway (2019). https://doi.org/10.1109/IMITEC45504.2019.
9015857, https://ieeexplore.ieee.org/document/9015857/
6. Johri, A., Lindsay, E., Qadir, J.: Ethical concerns and responsible use of generative artificial
intelligence in engineering education. In: Proceedings of the SEFI 2023 51st Annual Confer-
ence, 11.-14.9 2023, Dublin (2023)
7. Kiesler, N., Schiffner, D.: Large language models in introductory programming education:
ChatGPT’s performance and implications for assessments (2023). http://arxiv.org/abs/2308.
08572
8. Luger, E., Sellen, A.: Like Having a Really Bad PA, pp. 5286–5297. ACM, New York (2016).
https://doi.org/10.1145/2858036.2858288, https://dl.acm.org/doi/10.1145/2858036.2858288
9. Mhlanga, D.: Open AI in education, the responsible and ethical use of ChatGPT towards
lifelong learning. SSRN Electron. J. (2023). https://doi.org/10.2139/ssrn.4354422, https://
www.ssrn.com/abstract=4354422
10. Nguyen, A., Ngo, H.N., Hong, Y., Dang, B., Nguyen, B.P.T.: Ethical principles for artificial
intelligence in education. Educ. Inf. Technol. 28, 4221–4241 (2023). https://doi.org/10.1007/
s10639-022-11316-w, https://link.springer.com/10.1007/s10639-022-11316-w
11. Pankiewicz, M., Baker, R.S.: Large language models (GPT) for automating feedback on
programming assignments (2023). http://arxiv.org/abs/2307.00150
12. Puig-Ortiz, J., Pá mies-Vilá, R., Jordi Nebot, L.: Exploring the application of ChatGPT in
mechanical engineering education. In: Proceedings of the SEFI 2023 51st Annual Conference,
11.-14.9 2023, Dublin (2023)
13. Robinson, D.: How Do Students Use Stack Overflow? (2017). https://stackoverflow.blog/2017/
02/15/how-do-students-use-stack-overflow/. Last accessed 29 Mar 2023
14. Saari, M., Rantanen, P., Nurminen, M., Kilamo, T., Systä, K., Abrahamsson, P.: Survey of
AI tool usage in programming course: early observations. In: Agile Processes in Software
Engineering and Extreme Programming – Workshops. Springer, Cham (2024)
15. Savelka, J., Agarwal, A., Bogart, C., Song, Y., Sakr, M.: Can generative pre-trained transform-
ers (GPT) pass assessments in higher education programming courses? (2023). https://doi.org/
10.1145/3587102.3588792, http://arxiv.org/abs/2303.09325
346 M. Saari et al.