Selection of Prompt Engineering Techniques For Code Generation Through Predicting Code Complexity
Selection of Prompt Engineering Techniques For Code Generation Through Predicting Code Complexity
1 INTRODUCTION
Recently Large Language Models (LLMs) have shown their promising performance in various
software engineering tasks, such as unit test case generation [36–38], automated bug repair [15,
49], API specification [24]. Especially for code generation from natural language descriptions,
LLMs demonstrate their impressive capability where code is generated with natural language
descriptions [19, 39].
Given the state-of-the-art LLMs are all closed-source, the most popular way to enhance the LLM’s
ability to generate accurate and reliable code is to utilize various prompt engineering techniques
Authors’ addresses: Chung-Yu Wang, York University, Toronto, Canada, cywang14@yorku.ca; Alireza DaghighFarsoodeh,
York University, Toronto, Canada, aliredaq@yorku.ca; Hung Viet Pham, York University, Toronto, Canada, hvpham@yorku.
ca.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
XXXX-XXXX/2024/9-ART $15.00
https://doi.org/XXXXXXX.XXXXXXX
(PETs) [11, 40]. For example, Some techniques ask LLMs to provide reasoning steps for solving
problems [3, 52, 53], while others utilize LLMs to refine their output by prompting them to review
and improve the code they generate [6, 30]. In addition to these strategic PETs, some frameworks
have been proposed that leverage LLMs [58] or retrieve relevant instances from databases to
automatically generate optimal prompts for questions [31], a process known as automated prompt
engineering.
Despite numerous studies focused on crafting the optimal prompt, not a single technique is
optimally applicable to every query and the task of selecting a correct PET is not trivial. This is
because of two key reasons: (1) interactive prompting techniques might be too costly and do not
always provide the promised benefit especially when applied to simpler queries [7, 16], and (2)
existing automatic prompt engineering does not utilize the multiple stages of responses which
is associated with the success of iterative PETs [30, 56]. Not to mention, all of the auto prompt
engineering techniques are not easily extended.
Prior work [55] has proposed a framework to select the most appropriate PET for a given query
based on feedback from LLMs. However, this approach focuses on reasoning tasks and requires
implementation alongside language model execution, where the best answer is selected based on
the outputs of various techniques. This makes it less practical and quite costly.
To provide a general low-cost solution to the PETs selection task, we propose PET-Select, a PET
agnostic selection model that is not dependant on the pool of available PETs and can be easily
adaptable and extendable to the ever-growing list of available advanced PETs. PET-Select integrates
query complexity by using generated code complexity as a proxy using contrastive learning [22].
Specifically, by incorporating generated code complexity, PET-Select can differentiate each query
between simple and complex problems (i.e., requiring simple or complex code) which can help
PET-Select select the appropriate PET that targets the relevant level of problem. Furthermore, we
incorporate a wide range of PETs representing various categories [45] including those PETs that
have multi-round interactions with language models.
We evaluate PET-Select on two popular code generation benchmark datasets, MBPP and Hu-
manEval. To ensure a fair evaluation, we apply 5-fold cross-validation with 80% training and 20%
testing sets. Our evaluation on GPT-3.5 Turbo and GPT-4o shows that PET-Select achieves an
improvement of up to 1.9% in terms of pass@1 accuracy when compared with an individual PET
while using as little as 74.8% fewer tokens on the HumanEval with GPT-4o. Our quantitative and
qualitative results also demonstrate that PET-Select effectively selects appropriate techniques for
each code generation query. This paper makes the following contributions:
• PET-Select, a novel approach that automatically selects the most optimal prompting engi-
neering techniques for each code generation query.
• An evaluation of PET-Select on two widely used benchmark datasets using two state-of-the-art
LLMs.
• Quantitative and qualitative analyses provide insights into how PET-Select selects the appro-
priate PET.
2 BACKGROUND
2.1 Automated Prompt Engineering
Since Large Language Models (LLMs) are too large to fine-tune for every downstream task, prompt
engineering has become a common approach to optimize performance across various tasks, in-
cluding unseen ones. However, designing effective prompts for each task is a challenging process.
Several studies have suggested reliable methods to improve language model performance, such
as Chain-of-Thought and Self-correction prompting. Despite this, the question remains whether
we can develop a system that automatically generates appropriate prompts for different queries.
Previous studies [9, 58] proposed frameworks for automatic instruction generation and selection,
where several candidate prompts are generated by LLMs, and the best prompt is chosen from these
candidates. Another approach involves retrieving similar queries from a database and using them
to create a more effective prompt [31]. However, these automatic prompt engineering methods
primarily focus on crafting a single optimal prompt for a given problem. There is limited research
on how to design multi-round prompting, where multiple interactions with language models are
used to refine the response. Crafting prompts based on the model’s responses is crucial, as many
state-of-the-art prompting techniques rely on self-generated answers to achieve optimal perfor-
mance. Whether used for correction or evaluation, iterative interactions with language models play
a key role in helping them generate better responses.
Technically, prompting technique selection is also a form of automatic prompt engineering, as it
involves choosing the relatively appropriate prompt automatically. Unlike previous approaches,
prompting technique selection considers whether the prompt should be crafted for a single or mul-
tiple iterations, allowing for multiple rounds of interaction. A previous study [55] selects prompting
techniques after each execution, which is costly and impractical in real-world applications, particu-
larly when multiple techniques are considered as candidates. PET-Select is the first framework to
select prompting techniques prior to execution. It employs a traditional deep learning model with
contrastive learning to select the most suitable technique for each question, making it applicable
and affordable even without the need to run language models.
MBPP HumanEval
Execution
Record
2 PETs Ranking
Training
Test Set
Set
Query Triplets
3
Construction
Query Triples
Contrastive
4
Learning Finetuned CodeBERT Selection
Embedding Model
Training
5
Selection Model Selected Prompting
Technique
Selection Model
3 APPROACH
In this work, we propose PET-Select, a novel method to select suitable prompt engineering tech-
niques (PETs) for each query. Figure 1 provides the overview of PET-Select. PET-Select is a su-
pervised learning approach and since no such record of execution is available for various prompt
engineering techniques we start off by building the data in the Dataset Construction phase (Section
3.1). PET-Select’s model consists of two main parts: the embedding layer (Section 3.2) and the
classification layer (Section 3.3). Finally, we conduct a n-fold cross-validation evaluation to ensure
that PET-Select is correctly evaluated.
Write a function to get the word with most number of occurrences in the given strings list.
A Complexity Score: 17
Write a function to find the maximum product subarray of the given array.
N Complexity Score: 58
Before A After A
N P 0.01
0.01
P 0.013
Contrastive
Learning 0.05
In this work, for a given query (i.e., the anchor query), we select positive queries as those with
similar generated code complexity and negative queries as those with differing code complexity
(Step 3 ). For instance, given an anchor query “Write a function to get the word with most number
of occurrences in the given strings list.” with the generated code complexity score of 17, the positive
query could be “Write a python function to remove even numbers from a given list.” with the same
code complexity score of 17, and the negative query could be “Write a function to find the maximum
product subarray of the given array.” with a much higher code complexity score of 58.
However, some queries may not have queries with the same code complexity score, we instead
divided the entire training set into two categories: an easy set and a hard set. Queries with a code
complexity lower than a specified threshold are placed in the easy set, while those exceeding the
threshold are assigned to the hard set. We then randomly select a query from the same set as the
anchor query to serve as its positive query. Conversely, a query is randomly selected from the
opposite set to serve as the anchor query’s negative query. To determine the optimal threshold for
classifying the easy and hard sets, we conduct a grid search within the code complexity score range
of 25 to 45, where more than 70% of the scores are concentrated. The configuration that yields the
best result is selected as the optimal setting for the model.
3.2.2 Contrastive Learning. Once the query triplets are constructed, we use them to fine-tune the
CodeBERT sentence embedding model (Step 4 ). The objective of contrastive learning is to bring
queries with similar features and complexity closer together while pushing unrelated queries with
differing complexity further apart [32]. When constructing the query triplets, we designate an input
query as the anchor, treating queries with similar code complexity scores as positive examples,
while those with dissimilar scores are used as negative examples. This design allows the model
to learn semantic representations by associating anchor queries with their positive counterparts,
positioning them closer within the embedding vector space. Conversely, we expect the model to push
unrelated queries further apart from the anchor queries. Figure 2 illustrates the examples discussed
in Section 3.2.1 to show the progress of contrastive learning in PET-Select. Before contrastive
learning, the cosine distance between the anchor sentence (blue point) and the positive sentence
(green point) is 0.013, which is greater than the distance between the anchor sentence and the
negative sentence (red point), measured at 0.01. However, after contrastive learning, the positive
sentence is brought closer to the anchor sentence in the embedding vector space, reducing the
distance to 0.01, while the negative sentence is pushed farther away, increasing the distance to 0.05.
PET-Select’s model architecture is built with the Sentence Transformer framework, specifically
leveraging CodeBERT as a Transformer-based model for sentence embedding. First, the pre-trained
CodeBERT model is used to extract embeddings for each word in sentences (in the anchor, positive,
and negative queries). These word embeddings are aggregated with a pooling layer to create a
fixed-size sentence-level embedding. The embedding model is fine-tuned by minimizing a Triplet
Loss, which is computed based on the distances between the anchor-positive and anchor-negative
query pairs:
𝐿 = 𝑚𝑎𝑥 (0, 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑎𝑛𝑐ℎ𝑜𝑟,𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 − 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑎𝑛𝑐ℎ𝑜𝑟,𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 + 𝑚𝑎𝑟𝑔𝑖𝑛)
In short, the loss function is to learn an embedding space where semantically similar sentences are
clustered together (small distance), and dissimilar sentences are far apart (large distance) [2]. The
𝑚𝑎𝑟𝑔𝑖𝑛 is a positive value (by default set to 1 in the model) that defines a minimum gap between
the anchor-positive and anchor-negative distances. It ensures that the negative sentence is not
simply pushed just outside the positive one but is kept at a meaningful distance. The 𝑚𝑎𝑥 function
ensures the loss is non-negative, meaning if the distance between the negative and the anchor is
already sufficiently large, the loss will be zero (i.e., no update is needed for this triplet).
4 EXPERIMENTAL SETUP
In this section, we introduce the setup that we used to conduct our experiments. We first introduce
the prompting techniques that are included in PET-Select selection pool, we then discuss the code
complexity metrics, and finally, the experimental setting including the code generation datasets
and the evaluation metrics.
Table 1. The prompting techniques used in the experiments. The ‘Strategic Category’ column indicates the
primary strategy of each technique, chosen from one of the five categories defined in the previous study [45].
The ‘Iteration’ column specifies whether the technique requires multiple rounds of interaction with LLMs. The
‘Examples’ column shows whether examples are included in the prompt construction. Lastly, the ‘Template’
column outlines the specific prompt template used in the experiments.
briefly go through each PET and provide some pros and cons to emphasize that no one PET is
optimal for all cases.
Root PETs: Zero-shot and Few-shot Root PETs directly query LLMs for answers. Zero-shot
and Few-shot [3] are two examples of root PETs where Zero-shot provides no additional example
and Few-shot includes several examples. While it is convenient and requires no domain-specific
input, Zero-shot performance may be limited when the model encounters unfamiliar tasks. The
added examples in Few-shot PET improve LLMs’ ability to handle unseen tasks but are not trivial
to craft [8, 28, 33] and can negatively impact the performance if given incorrectly [29, 35].
Ranking Evaluation Metrics Since PET-Select ranks all the prompting techniques based on
the probability of softmax layer output, we applied two popular metrics, Mean Reciprocal Rank
(MRR) [34, 46] and Normalized Discounted Cumulative Gain (nDCG) [17], to evaluate PET-Select.
These metrics are used extensively in the domain of information retrieval and they measure the
ability of a system in recommendation tasks. The Mean Reciprocal Rank measures the effectiveness
of a system in returning relevant results or answers by focusing on the rank of the first correct. On
the other hand, Normalized Discounted Cumulative Gain (nDCG) measures the quality of ranked
results based on the relevance of each result and the position in which they appear in a ranking list.
With these two metrics, we can thoroughly evaluate PET-Select’s ability to recommend and rank
the appropriate PET.
Environmental Settings We utilize a machine with an 8-core processor AMD Ryzen 7 pro 5845
and an NVIDIA RTX3060 to train PET-Select. To better evaluate PET-Select we applied 5-fold cross-
validation with 80-20 train-test split. Note that the sentence embedding model and the selection
model were only trained on the training set to prevent test data leakage into the sentence embedding
model, which could otherwise impact the performance of the selection model. We fine-tune the
sentence embedding model for fifteen epochs and select the model with the best performance
(the highest value of Cosine Accuracy) on the validation set to train the selection model. For the
selection model, we train it for 10 epochs and select the model with the best performance (the
highest value of nDCG) on the validation set to choose prompting techniques for each instance in
the test set.
5 RESULT
In this section, we evaluate PET-Select and present the findings when exploring three research
questions. RQ1 explores how various PETs perform on different types of code generation with
different complexity (Section 5.1). In RQ2, we compare PET-Select performance against other
baselines on two code generation benchmarks using two versions of GPT (Section 5.2). Finally, we
analyze PET-Select’s performance in quantitative and qualitative analysis (Section 5.3).
5.1 RQ1. How do various PETs perform on different types of code generation with
different complexity?
In this research question, we aim to explore the relationship between the code generation types
and code complexity to inform our design decisions to incorporate query embedding and generated
code complexity in PET-Select.
5.1.1 RQ1.1 Do different PETs excel at generating code for different types of tasks? To explore the
first part of the question, we first manually categorize questions from the MBPP and HumanEval
datasets into six different types of tasks for which the generated code is responsible: Algorithm
Design, String Manipulation, List Processing, Mathematical Computation, Parsing and Formatting,
and Logical Conditions.
Specifically, we applied the following definition to perform the labeling:
• Algorithm Design involves writing code to solve problems using specific approaches or
procedures. Algorithm design includes tasks like designing search algorithms (e.g., binary
search), sorting (e.g., quicksort), and dynamic programming. The focus is on the logic and
structure required to solve problems efficiently.
• String Manipulation deals with operations related to handling text data, such as modify-
ing, concatenating, splitting, and searching within strings. Common tasks include pattern
matching (using regular expressions), converting cases (e.g., uppercase to lowercase), and
formatting strings for output.
38 38
37 43 43
36 42 42
35 41
34
15 32 40 37
15 30
13 13 13 13
12 12
10 30
10 9 21 26
20
10
Correct Instances
35 35
35 4
34 34 34 34 8
32 32
6 2 2
30 5 5 5 5 5 5 2
28 4 4 4 1 1 1 1 1 1 1
4
Fig. 3. The distribution of correct instances across nine PETs on the HumanEval dataset using GPT-4o.
• List Processing involves handling collections or arrays of data. Operations include iterating
through lists, filtering, mapping, sorting, and transforming data. Tasks like merging multiple
lists or finding elements based on specific conditions also fall under this category.
• Mathematical Computation covers tasks that involve performing mathematical operations,
such as arithmetic, algebra, trigonometry, or calculus. Examples include calculating averages,
finding prime numbers, performing matrix operations, or solving equations.
• Parsing refers to interpreting structured data, such as converting a string into a number,
extracting values from JSON or XML, or reading configuration files.
• Formatting involves preparing data for output, such as formatting dates, numbers, or
aligning text for display.
• Logical Conditions involves decision-making in code, where you use conditions to control
the flow of the program (e.g., if-else statements, switch cases). Logical conditions help
programs execute different paths based on input or state, such as checking if a number is
even or odd, or deciding which function to call based on user input.
Figure 3 presents the distribution of correct instances across different PETs on the HumanEval
dataset using GPT-4o. For each task, some PETs are more effective than others. For example,
Progressive Hint yields the highest number of correct instances in Algorithm Design, while for
String Manipulation, the most successful technique shifts to Few-shot CoT. This finding suggests
that each technique excels in different tasks, indicating its unique area of expertise. As a result,
we included Category Selection as one of our baselines in RQ2 to explore whether choosing PETs
4 4
8 7.5 7.5 4 10 9.2 9.3
7.3 3.6 8.2
7 3.4 3.4
6.6 8 6.7
7.4
6.8 6.8
5.9 6 3
6 5.3
3 2.6 2.6 2.6 6 5.6
59.3
30 28.4 60
3.9
4 25
Values
Fig. 4. The distribution of code complexity scores for the ground-truth code, correctly answered by each PET
across six code complexity metrics on the MBPP dataset using GPT-3.5 Turbo.
directly based on the specific code generation tasks could help identify the most suitable technique
for each question.
5.1.2 RQ1.2 Do different PETs excel at generating code of different complexity? Apart from task
types, we also explored if code complexity can inform the correct PET. We hypothesized that simpler
techniques might perform better on easier questions (i.e., requiring less complex code), while more
complex techniques could be more effective on harder ones (i.e., requiring more complex code). To
test this, we applied five code complexity metrics mentioned previously to the ground-truth code
for each instance in the MBPP and HumanEval datasets. To account for multiple aspects of code
complexity, we aggregate all the complexity scores into a single value called Combined Complexity,
which serves as the final complexity score for each instance.
Figure 4 demonstrates the distribution of code complexity scores for the ground-truth code,
correctly answered by each PET, across six code complexity metrics on the MBPP dataset using
GPT-3.5 Turbo. We can observe that the code complexity score of the ground truth solutions for
questions that are answered correctly in zero-shot is lower than that of most techniques across
all code complexity metrics. For example, in terms of line complexity, all PETs except Few-shot
CoT achieve higher scores than Zero-shot. This indicates that the ground-truth code for questions
correctly answered by Zero-shot tends to have fewer lines compared to those answered by other
techniques. This finding suggests that selecting PETs based on the code complexity score could
be an effective approach that can support our proposal of incorporating code complexity into
PET-Select.
Table 2. Pass@1 accuracy and token usage were evaluated on benchmark datasets using PET-Select and
twelve baselines, including nine prompting techniques and four selection baselines across different models.
Prompt engineering techniques marked with *, indicate that these techniques require iterative rounds to
arrive at the answer for each instance. Acc refers to the Pass@1 accuracy, while #Token represents the
average token usage. The highest accuracy scores and the lowest token usage for each dataset and model are
highlighted in bold.
Finding 1: In RQ1, we identified two relationships that can inform our design decision of PET-
Select: task types and task complexity. We also include two baseline approaches for selecting
appropriate PETs: choosing techniques based on types of tasks or selecting them according
to code complexity scores. However, both of these baselines will require additional manual
labeling.
Zero-shot being the most appropriate technique for Algorithm Design is 60% (i.e., among all the
questions correctly answered by the language model, Zero-shot is the most appropriate technique
for 60% of them), then Zero-shot will have a 60% chance of being selected for questions categorized
under Algorithm Design. For the ‘PET-Select W/o CL’ row, we train the selection model using
the original CodeBERT without contrastive learning which does not incorporate the complexity
measure. For the ‘PET-Select’ row, we present the results of selecting PETs based on the output of
the selection model.
On the MBPP dataset, PET-Select achieves 65.6% accuracy with GPT-3.5 Turbo, which is 0.3%
higher than the best accuracy achieved by Self-debug, a technique that applies the same method
across all instances. Furthermore, PET-Select uses approximately 13% fewer tokens while achieving
higher accuracy compared to Self-debug. This indicates that PET-Select can effectively identify in-
stances that are simple enough for language models to generate correct code using basic techniques.
A similar result is observed when running experiments with GPT-4o, where PET-Select’s accuracy
is 0.6% higher than using only Self-debug, while also utilizing fewer tokens. On the HumanEval
dataset, PET-Select achieves the same accuracy as Few-shot CoT but with 64.2% fewer tokens when
using GPT-3.5 Turbo. With GPT-4o, PET-Select achieves an accuracy of 85.4%, which is 1.9% higher
than the best accuracy of the other techniques, while also saving up to 74.8% fewer tokens.
Although the Category Selection method does not achieve the highest accuracy, it remains at
least the third-best approach among all the baselines, with the lowest token usage when applied to
the HumanEval dataset. This indicates that knowing the task category partially helps in selecting
the optimal PET.
The original CodeBERT without contrastive learning incorporating problem complexity does
not help the selection model consistently choose the appropriate techniques. Instead, it repeatedly
selects Zero-shot, as it often appears to be the best technique among all the options. This result
suggests that contrastive learning effectively clusters questions of similar complexity in the em-
bedding space, and is essential in enabling the selection model to accurately choose the optimal
PET.
Complex PETs such as Self-debug, which require multiple rounds with language models, may
not always be the best choice for all questions. For instance, aside from PET-Select, while Self-
debug performs best on the MBPP dataset, it falls short on the HumanEval dataset, where simpler
techniques like Few-shot CoT achieve the highest accuracy. This result provides more examples
which support the claim that applying complex techniques to simpler questions can sometimes
result in incorrect answers. With PET-Select, we can identify instances that are simple enough to
not require complex techniques, while still generating the correct answers with fewer tokens.
Finding 2: Overall, PET-Select outperforms other baseline approaches across different versions
of GPT on both datasets, achieving comparable or up to 1.9% of accuracy improvement with up to
74.8% fewer tokens. Compared to other baselines, PET-Select effectively selects the appropriate
techniques based on embeddings adjusted by the contrastive learning CodeBERT model.
5.3 RQ3. How is PET-Select able to select an appropriate technique for each query?
In this section, we perform quantitative and qualitative analyses to assess PET-Select’s ability to
select the most appropriate technique for each question.
5.3.1 Quantitative Analysis. As mentioned in Experimental Setup, we utilize two metrics, Mean
Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (nDCG), to evaluate PET-
Select’s recommendation ability. In Table 3 we present various selection methods’ effectiveness
Table 3. The ranking effectiveness of selection methods measured with MRR and nDCG metrics
measured by MRR and nDCG. Since we applied 5-fold cross-validation the MRR and nDCG values
are the average results from the test set across five folds.
Without contrastive learning, PET-Select W/o CL achieves a high MRR value across all experi-
ments. This occurs because the selection model consistently chooses Zero-shot as the appropriate
technique. As a result, PET-Select W/o CL tends to perform well in MRR, since Zero-shot is often
the most suitable technique for questions it answers correctly. However, a higher MRR score does
not necessarily indicate that the best technique is selected for every instance. It simply means that
for the instances where the selected technique provides a correct answer, the chosen method is
likely one of the top-performing options. This is further demonstrated in Table 2, where PET-Select
without contrastive learning does not achieve the highest accuracy but often uses fewer tokens
than other techniques.
On the other hand, PET-Select consistently achieves the highest performance with respect to
nDCG metric. This indicates that it can reliably select techniques that lead to correct answers.
Although PET-Select falls short on the MRR metric, meaning it doesn’t always choose the most
appropriate technique for every instance, the selected PET still generates the correct code that
passes all test cases. This is evidenced in Table 2, where PET-Select outperforms other approaches
in terms of accuracy across all experiments. This result indicates that PET-Select is effective in
selecting the correct technique that is capable of generating the correct code.
5.3.2 Qualitative Analysis. This section aims to provide some additional support for the experi-
mental results by analyzing the queries that were only answered correctly by Zero-shot (our most
simple PET) and successfully selected by PET-Select. Conversely, we also examined the queries that
were only answered correctly by Self-debug (our most complex PET) and were likewise successfully
selected by PET-Select. The purpose of these analyses is to provide additional examples that explain
the reason why PET-Select is successful in selecting the correct PET in the previous experiments.
Table 4 lists some example instances in the MBPP dataset. For instance, questions containing
the term ‘nested’ (numbers 1-3 in table 4) will likely require complex code as it will likely involves
iterative loops. Complex PETs such as Self-debug are more likely to generate the correct answer.
while basic techniques such as Zero-shot tend to answer incorrectly. PET-Select successfully selects
the appropriate technique between Zero-shot and Self-debug, indicating that it learns to recognize
such keywords in the queries. By placing sentences containing the word ‘nested’ closer together in
the embedding space, PET-Select is able to classify them and select the correct PETs.
In contrast, sentences that do not contain specific keywords are pushed further away from
those that do. As a result, PET-Select will select relatively basic techniques for those questions.
For example, the queries 4-5 in table 4 are also related to the List Processing tasks, they only
Table 4. Selection result of PET-Select for example instances. ✗ indicates the technique answer correctly on
the question, while ✓ indicates it answer incorrectly.
require a single loop to solve. In this case, Zero-shot is a more appropriate PET while Self-debug
is too complex and sub-optimal. Since those questions do not contain the specific keywords that
indicate complex problems (e.g., nested), PET-Select selects Zero-shot instead of Self-debug as the
appropriate technique. The above examples demonstrate that PET-Select can effectively select the
appropriate technique based on code complexity predictions derived from keywords in the queries
with the help of contrastive learning. By selecting simpler PET when appropriate, PET-Select
not only performs well in all cases but also reduces the overall number of tokens required when
compared to complex state-of-the-art PETs such as Self-debug.
Finding 3: Through quantitative analysis, we found that while PET-Select does not always
select the most efficient technique in terms of token usage, it still manages to provide correct
answers by choosing techniques that are capable of generating the correct code. Additionally,
qualitative analysis revealed that PET-Select’s improvement over the best individual PET can
be explained by its ability to select simpler PET when appropriate which reduces token usage
while maintaining a high generated code passing rate.
6 RELATED WORK
6.1 Code Complexity Prediction
Code complexity prediction has emerged as a key area of focus in recent research, with various
approaches leveraging machine learning and deep learning techniques. A notable advancement is
the application of deep learning models, such as hierarchical Transformers, which process method-
level code snippets and aggregate them into class-level embeddings [18]. These models excel
in handling longer code sequences, surpassing previous methods through advanced multi-level
pre-training objectives that enhance the model’s understanding of complexity-related features.
Additionally, studies have explored the effectiveness of GPT-3-based models like GitHub Copilot,
highlighting both their strengths and limitations in zero-shot complexity prediction [43]. While
Copilot performs well with linear complexities, specialized deep learning models demonstrate
superior overall accuracy.
In contrast, PET-Select diverges from these methods by not concentrating on code complexity
prediction directly from natural language queries. Instead, we employ complexity prediction as
an intermediate step to determine the appropriate prompting techniques for answering natural
language questions.
7 THREATS TO VALIDITY
7.1 Internal validity
Our code has been thoroughly reviewed to ensure the implementation is correct, and we have
confirmed that the questions in the testing dataset are not present in the question base. We also
carefully craft our prompts for each prompting technique, adhering closely to the guidelines
outlined in the original paper for each method. However, the way prompts and examples are crafted
may influence the performance of each technique, which in turn can affect the results of PET-Select.
8 CONCLUSION
In this paper, we introduced PET-Select, a novel system designed to automatically select appro-
priate prompt engineering techniques (PETs) for code generation tasks based on code complexity
predictions. By leveraging contrastive learning and a CodeBERT-based sentence embedding model,
PET-Select effectively identifies simpler questions and applies suitable techniques, achieving com-
parable or higher accuracy with fewer tokens. Our evaluation of the MBPP and HumanEval datasets
demonstrates that PET-Select not only enhances performance but also reduces computational costs.
Future work will focus on refining the model and exploring its application to other domains.
9 DATA AVAILABILITY
We release our code and data through the following link:
https://anonymous.4open.science/r/Prompt-Selection-B47F.
REFERENCES
[1] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie
Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732
(2021).
[2] Hervé Bredin. 2017. Tristounet: triplet loss for speaker turn embedding. In 2017 IEEE international conference on
acoustics, speech and signal processing (ICASSP). IEEE, 5430–5434.
[3] Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).
[4] G Ann Campbell. 2018. Cognitive Complexity-A new way of measuring understandability. SonarSource SA (2018), 10.
[5] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards,
Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv
preprint arXiv:2107.03374 (2021).
[6] Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug.
arXiv preprint arXiv:2304.05128 (2023).
[7] Cheng-Han Chiang and Hung-Yi Lee. 2024. Over-Reasoning and Redundant Calculation of Large Language Models. In
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2:
Short Papers). 161–169.
[8] Hai Dang, Lukas Mecke, Florian Lehmann, Sven Goller, and Daniel Buschek. 2022. How to Prompt? Opportunities and
Challenges of Zero- and Few-Shot Learning for Human-AI Interaction in Creative Applications of Generative Models.
arXiv:2209.01390 [cs.HC]
[9] Viet-Tung Do, Van-Khanh Hoang, Duy-Hung Nguyen, Shahab Sabahi, Jeff Yang, Hajime Hotta, Minh-Tien Nguyen,
and Hung Le. 2024. Automatic Prompt Selection for Large Language Models. arXiv preprint arXiv:2404.02717 (2024).
[10] Christof Ebert, James Cain, Giuliano Antoniol, Steve Counsell, and Phillip Laplante. 2016. Cyclomatic complexity.
IEEE software 33, 6 (2016), 27–29.
[11] Sidong Feng and Chunyang Chen. 2024. Prompting is all you need: Automated android bug replay with large language
models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13.
[12] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu,
Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint
arXiv:2002.08155 (2020).
[13] T Hariprasad, G Vidhyagaran, K Seenu, and Chandrasegar Thirumalai. 2017. Software complexity analysis using
halstead metrics. In 2017 International Conference on Trends in Electronics and Informatics (ICEI). IEEE, 1109–1113.
[14] Elad Hoffer and Nir Ailon. 2015. Deep metric learning using triplet network. In Similarity-based pattern recognition:
third international workshop, SIMBAD 2015, Copenhagen, Denmark, October 12-14, 2015. Proceedings 3. Springer, 84–92.
[15] Soneya Binta Hossain, Nan Jiang, Qiang Zhou, Xiaopeng Li, Wen-Hao Chiang, Yingjun Lyu, Hoan Nguyen, and Omer
Tripp. 2024. A deep dive into large language models for automated bug localization and repair. Proceedings of the ACM
on Software Engineering 1, FSE (2024), 1471–1493.
[16] Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou.
2023. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798 (2023).
[17] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on
Information Systems (TOIS) 20, 4 (2002), 422–446.
[18] Mingi Jeon, Seung-yeop Baik, Joonghyuk Hahn, Yo-Sub Han, and Sang-Ki Ko. 2023. Deep learning-based source code
complexity prediction. (2023).
[19] Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for
Code Generation. arXiv preprint arXiv:2406.00515 (2024).
[20] Xue Jiang, Yihong Dong, Lecheng Wang, Fang Zheng, Qiwei Shang, Ge Li, Zhi Jin, and Wenpin Jiao. 2023. Self-planning
Code Generation with Large Language Models. ACM Transactions on Software Engineering and Methodology (2023).
[21] Yue Jiang, Bojan Cuki, Tim Menzies, and Nick Bartlow. 2008. Comparing design and code metrics for software quality
prediction. In Proceedings of the 4th international workshop on Predictor models in software engineering. 11–18.
[22] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and
Dilip Krishnan. 2020. Supervised contrastive learning. Advances in neural information processing systems 33 (2020),
18661–18673.
[23] Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2022.
Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406 (2022).
[24] Myeongsoo Kim, Tyler Stennett, Dhruv Shah, Saurabh Sinha, and Alessandro Orso. 2024. Leveraging large language
models to improve REST API testing. In Proceedings of the 2024 ACM/IEEE 44th International Conference on Software
Engineering: New Ideas and Emerging Results. 37–41.
[25] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models
are zero-shot reasoners. Advances in neural information processing systems 35 (2022), 22199–22213.
[26] Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv
preprint arXiv:2104.08691 (2021).
[27] Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint
arXiv:2101.00190 (2021).
[28] Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2021. What Makes Good
In-Context Examples for GPT-3? arXiv preprint arXiv:2101.06804 (2021).
[29] Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2021. Fantastically ordered prompts and
where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786 (2021).
[30] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri,
Shrimai Prabhumoye, Yiming Yang, et al. 2024. Self-refine: Iterative refinement with self-feedback. Advances in Neural
Information Processing Systems 36 (2024).
[31] Noor Nashid, Mifta Sintaha, and Ali Mesbah. 2023. Retrieval-based prompt selection for code-related few-shot learning.
In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2450–2462.
[32] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding.
arXiv preprint arXiv:1807.03748 (2018).
[33] Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. True few-shot learning with language models. Advances in
neural information processing systems 34 (2021), 11054–11070.
[34] Dragomir R Radev, Hong Qi, Harris Wu, and Weiguo Fan. 2002. Evaluating web-based question answering systems.. In
LREC. Citeseer.
[35] Ohad Rubin, Jonathan Herzig, and Jonathan Berant. 2021. Learning to retrieve prompts for in-context learning. arXiv
preprint arXiv:2112.08633 (2021).
[36] Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models
for automated unit test generation. IEEE Transactions on Software Engineering (2023).
[37] Jiho Shin, Sepehr Hashtroudi, Hadi Hemmati, and Song Wang. 2024. Domain Adaptation for Code Model-Based
Unit Test Case Generation. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing
and Analysis (Vienna, Austria) (ISSTA 2024). Association for Computing Machinery, New York, NY, USA, 1211–1222.
https://doi.org/10.1145/3650212.3680354
[38] Jiho Shin, Hadi Hemmati, Moshi Wei, and Song Wang. 2024. Assessing evaluation metrics for neural test oracle
generation. IEEE Transactions on Software Engineering (2024).
[39] Jiho Shin and Jaechang Nam. 2021. A survey of automatic code generation from natural language. Journal of Information
Processing Systems 17, 3 (2021), 537–555.
[40] Jiho Shin, Clark Tang, Tahmineh Mohati, Maleknaz Nayebi, Song Wang, and Hadi Hemmati. 2023. Prompt engineering
or fine tuning: An empirical assessment of large language models in automated software engineering tasks. arXiv
preprint arXiv:2310.10508 (2023).
[41] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. 2020. Autoprompt: Eliciting
knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980 (2020).
[42] Yonghee Shin and Laurie Williams. 2008. An empirical model to predict security vulnerabilities using code complex-
ity metrics. In Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and
measurement. 315–317.
[43] Mohammed Latif Siddiq, Abdus Samee, Sk Ruhul Azgor, Md Asif Haider, Shehabul Islam Sawraz, and Joanna CS Santos.
2023. Zero-shot prompting for code complexity prediction using github copilot. In 2023 IEEE/ACM 2nd International
Workshop on Natural Language-Based Software Engineering (NLBSE). IEEE, 56–59.
[44] Hao Sun. 2023. Offline prompt evaluation and optimization with inverse reinforcement learning. arXiv preprint
arXiv:2309.06553 (2023).
[45] Catherine Tony, Nicolás E Díaz Ferreyra, Markus Mutas, Salem Dhiff, and Riccardo Scandariato. 2024. Prompting
Techniques for Secure Code Generation: A Systematic Investigation. arXiv preprint arXiv:2407.07064 (2024).
[46] Ellen M Voorhees et al. 1999. The trec-8 question answering track report.. In Trec, Vol. 99. 77–82.
[47] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022.
Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing
systems 35 (2022), 24824–24837.
[48] Moshi Wei, Nima Shiri Harzevili, Yuchao Huang, Junjie Wang, and Song Wang. 2022. Clear: contrastive learning for
api recommendation. In Proceedings of the 44th International Conference on Software Engineering. 376–387.
[49] Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. 2023. Copiloting the copilots: Fusing large language models
with completion engines for automated program repair. In Proceedings of the 31st ACM Joint European Software
Engineering Conference and Symposium on the Foundations of Software Engineering. 172–184.
[50] Kurt D Welker. 2001. The software maintainability index revisited. CrossTalk 14 (2001), 18–21.
[51] Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-
Smith, and Douglas C Schmidt. 2023. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv
preprint arXiv:2302.11382 (2023).
[52] Ling Yang, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, Minkai Xu, Wentao Zhang, Joseph E Gonzalez, and Bin Cui. 2024.
Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models. arXiv preprint arXiv:2406.04271
(2024).
[53] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of
thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems
36 (2024).
[54] Hongyu Zhang, Xiuzhen Zhang, and Ming Gu. 2007. Predicting defective software components from code complexity
measures. In 13th Pacific Rim International Symposium on Dependable Computing (PRDC 2007). IEEE, 93–96.
[55] James Xu Zhao, Yuxi Xie, Kenji Kawaguchi, Junxian He, and Michael Qizhe Xie. 2023. Automatic model selection with
large language models for reasoning. arXiv preprint arXiv:2305.14333 (2023).
[56] Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. 2023. Progressive-hint prompting improves
reasoning in large language models. arXiv preprint arXiv:2304.09797 (2023).
[57] Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier
Bousquet, Quoc Le, et al. 2022. Least-to-most prompting enables complex reasoning in large language models. arXiv
preprint arXiv:2205.10625 (2022).
[58] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022.
Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022).