0% found this document useful (0 votes)

36 views33 pages

Divide-and-Conquer Meets Consensus: Unleashing The Power of Functions in Code Generation

Uploaded by

scientistsriram04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views33 pages

Divide-and-Conquer Meets Consensus: Unleashing The Power of Functions in Code Generation

Uploaded by

scientistsriram04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Divide-and-Conquer Meets Consensus:

Unleashing the Power of Functions in Code Generation

Jingchang Chen1∗ Hongxuan Tang1∗ Zheng Chu1 Qianglong Chen2

Zekun Wang1 Ming Liu1† Bing Qin1
1
Harbin Institute of Technology
arXiv:2405.20092v1 [cs.CL] 30 May 2024

2
Zhejiang University
{jcchen, zchu, zkwang, mliu, qinb}@ir.hit.edu.cn
jeffswt@outlook.com chenqianglong.ai@gmail.com

Abstract
Despite recent progress made by large language models in code generation, they
still struggle with programs that meet complex requirements. Recent work uti-
lizes plan-and-solve decomposition to decrease the complexity and leverage self-
tests to refine the generated program. Yet, planning deep-inside requirements
in advance can be challenging, and the tests need to be accurate to accomplish
self-improvement. To this end, we propose F UN C ODER, a code generation frame-
work incorporating the divide-and-conquer strategy with functional consensus.
Specifically, F UN C ODER recursively branches off sub-functions as smaller goals
during code generation, represented by a tree hierarchy. These sub-functions are
then composited to attain more complex objectives. Additionally, we designate
functions via a consensus formed by identifying similarities in program behavior,
mitigating error propagation. F UN C ODER outperforms state-of-the-art methods by
+9.8% on average in HumanEval, MBPP, xCodeEval and MATH with GPT-3.5 and
GPT-4. Moreover, our method demonstrates superiority on smaller models: With
F UN C ODER, StableCode3b surpasses GPT-3.5 by +18.6% and achieves 97.7% of
GPT-4’s performance on HumanEval. Further analysis reveals that our proposed
dynamic function decomposition is capable of handling complex requirements, and
the functional consensus prevails over self-testing in correctness evaluation.

1 Introduction
Over the past few years, large language models have been observed to attain significant advancements
in coding capabilities (OpenAI, 2023; Touvron et al., 2023). Meanwhile, models designed specifically
for coding tasks have also been introduced (Rozière et al., 2023; Lozhkov et al., 2024; Pinnaparaju
et al., 2024). Although LLMs can proficiently generate simple code snippets, they suffer from a
decline in performance as code requirements become complicated.
Numerous efforts have been made to tackle this complexity. The two-stage methods (Jiang et al.,
2023; Zelikman et al., 2023) employ the plan-and-solve strategy, which first generates a draft outline
for the complex task and uses it as guidance for implementing the code in the second stage. Multi-
agent development frameworks (Hong et al., 2024; Qian et al., 2023) mimic real-world software
development workflows, assign different roles to LLMs and collaborate to solve a complex goal.
Self-improvement (Shinn et al., 2023; Chen et al., 2024), on the other hand, refines the program in
accordance with execution feedback from self-generated unit tests.
Despite fruitful efforts made by the previous methods in dealing with complex problems, certain
challenges still remain unsolved: (1) Two-stage approaches need to design a complete plan at the
∗ †
Equal contribution. Corresponding author.

Preprint. Under review.

F UN C ODER ( 𝑓(𝑥) )
Sampled Functional Consensus
Implementations 𝒇∗ 𝒙 =

𝒇′ 𝒙 = divide conquer $
𝑓(") ⋮
$
𝑓(%) 𝑓 $&'"

⋮ 𝑓 $&

FUNCODER 𝑥"
𝑔(𝑦) 𝑔∗ (𝑦) 𝑥# 𝑔∗ (𝑦)
𝑋= …
FUNCODER
ℎ(𝑧) ℎ∗ (𝑧) 𝑥$ ℎ∗ (𝑧)
Possible Inputs

Legends Example Divide(𝑑, 𝑒) Divide(c)

Divide(𝑏) Conquer(𝐷, 𝐸) Conquer(𝐵) Conquer(𝐶) Conquer(𝐴)
𝑓(𝑥) Problem Divide(𝑎)
𝐴 𝐴 𝐴 𝐴 𝐴
𝐴
𝑓′(𝑥) Partially Implemented 𝑎 𝐵 𝑐 𝐵 𝑐 𝐵 𝑐 𝐵 𝐶 𝐵 𝐶
𝑏 𝑐
𝑓 ∗ (𝑥) Reached Consensus 𝑑 𝑒 𝐷 𝐸 𝐷 𝐸 𝐷 𝐸 𝐷 𝐸

Figure 1: A flowgraph illustrates F UN C ODER. F UN C ODER branches off new functions to have
sub-goals tackled iteratively (left), re-composites sub-functions, and selects the best using functional
consensus (right). Bottom-right figure shows how F UN C ODER writes functions at hierarchy-level.

beginning and lack the ability to adjust the top-level design during implementation, leading to sub-
optimal decomposition. (2) Multi-agent collaboration frameworks are cumbersome and rely heavily
on LLM capabilities, making them difficult to generalize to smaller open-source models. (3) Code
refinement through self-tests depends on the correctness of generated unit-tests. Our preliminary
study (§3.1.3) finds that models generate unreliable self-tests in abundance. These incorrect tests may
mislead self-improvement and, at worse, exacerbate program errors.
To address these issues, we propose F UN C ODER, a code generation framework utilizing a divide-and-
conquer strategy and a novel functional consensus mechanism on functions to decompose complex
problems. Starting from the main problem, F UN C ODER introduces new functions to cope with
certain sub-problems. The new functions will be decomposed recursively, eventually forming a tree
of functions. F UN C ODER then combines functions bottom-up to achieve increasingly complicated
objectives. By dividing-and-conquering tasks into simpler sub-functions, complexity can be gradually
reduced. However, errors in sub-functions may propagate to the whole program, thereby damaging
overall reliability. We propose functional consensus that samples multiple functions and selects the
one demonstrating consensus, measured by the aggregated similarity among candidates. By reaching
a consensus, we reduce the discrepancies in code behavior and thus alleviate cascading errors.
We conduct extensive experiments on code generation benchmarks (Chen et al., 2021; Austin et al.,
2021; Khan et al., 2023) with GPT (Ouyang et al., 2022; OpenAI, 2023), outperforming state-of-
the-art methods by +9.8% on average. Experiments are further carried out on the mathematical
competition benchmark, MATH (Hendrycks et al., 2021b), achieving a +6.0 improvement with
GPT-4, indicating that F UN C ODER can also generalize to complex reasoning. Our method is observed
to be equally effective on open-source models (Rozière et al., 2023; Pinnaparaju et al., 2024; Meta
AI, 2024), with an average gain over baseline of +38.0% on HumanEval and +61.1% on MATH.
Additional analysis also shows the advantage of both divide-and-conquer and functional consensus.

2 F UN C ODER: Divide-and-Conquer Meets Consensus

2.1 Divide-and-Conquer for Iterative Programming

A function is defined as a relation between a set of inputs and outputs where each input is assigned
exactly one output (Halmos, 1998), denoted as y = f (x). In computer programming, a function is
identified by its header hf with its body bf , and is commonly accompanied by a documentation df to
improve readability. Functions can be invoked from other procedures, allowing for the decomposition
of large and complicated requirements into smaller structures that exhibit high comprehensibility
and quality (Dahl et al., 1972). Generally, human programmers tend to decompose tasks into clearly

2
Algorithm 1 F UN C ODER procedure (a) Planning-based Decomposition

Require: Entry func, froot = {hroot , droot , ϕ} 1. Create a function …

Require: Large language model, LLM Q 2. Generate a sequence

3. Check if ... then ...
1: function F UN C ODER(fcur ) 4. Return ... plan code
2: — Divide —
′
3: fcur , {fi } ← E XTRACT(LLM(fcur )) (b) Decompose Through Coding (ours)
4: for fi ∈ {fi } do
1. Writing current function
5: if bi is N OT I MPLEMENTED then
def sum_common_factors(a, b):
6: fi∗ ← F UN C ODER(fi ) ▷ recursion """sum the common factors"""
3. Extract
7: end if fa = find_factors(a)
code to a tree
fb = find_factors(b)
8: A DD C HILD(fcur , fi∗ ) return sum_common(fa, fb)
9: end for
2. Introduce new functions
10: — Conquer — def find_factors(x: int) -> list:
′
11: Fcur ← S AMPLE(LLM(fcur , C HILD(fcur )) ""“return factors of x"""
𝑓
∗ raise NotImplementedError()
12: fcur ← F UN C ONSENSUS(Fcur )
∗
13: return fcur def sum_common(a: list, b: list) -> int:
14: end function ""“sum of common elements""" 𝑔 ℎ
raise NotImplementedError()
15: return F UN C ODER(froot ) ▷ starts from root

Figure 2: Left: Algorithm for F UN C ODER procedure. Right: Comparison between decomposition by
planning and our approach. F UN C ODER introduces new functions to describe sub-goals solely with
code, achieving a more natural way of requirement decomposition.

defined sub-functions and then implement them recursively, making functions eligible for re-usage,
taking advantage of the divide-and-conquer principle. Inspired by this, F UN C ODER recursively
divides the requirement and conquers functions to formulate a sophisticated solution, unleashing the
potential of LLMs in code generation.
Divide is a top-down process that iteratively breaks down problems. Given a code generation
problem, the process begins from the entry function froot . We instruct the model to introduce new
functions fi ∈ CHILD(fcur ) that solve certain sub-goals while writing the current fcur . To reduce the
complexity involved in each generation, we only require the headers hfi and documentation dfi of
new functions to be generated, while their implementations bfi can be postponed. After completing
the current function, the model starts to address those unimplemented sub-functions and complete bfi
into fi′ . This process stops when the model deems functions too simple to be further divided, finally
forming a dependency tree T = T REE(froot , C HILD(froot )). The divide process is similar to a search
starting from the entry function, gradually involving new sub-functions while writing the current, and
implementing them recursively. We guide the entire process through a depth-first search.
Conquer is a process of achieving complex objectives through aggregating smaller functions. We
notice that child functions are not yet implemented during the top-down process of writing parent
functions. As a result, these parent functions may not be able to effectively utilize the child functions,
or misuse them at worst. F UN C ODER deals with this issue by re-generating functions in inverse
topological order on the dependency tree T - starting from leaves, complex goals are handled by
∗ ′
compositing solved children as fcur ← F(fcur , {f1∗ , f2∗ , . . . }) | fi∗ ∈ C HILD(fcur ).
Divide and conquer naturally achieve both decomposition and composition during code generation.
Unlike two-stage and agent-based methods, our approach dynamically introduces new functions
along the process, making it less burdensome than producing a complete plan at the very beginning.
Moreover, while planning or agents require chat capabilities, F UN C ODER represents sub-tasks
through functions (Figure 2), making it more applicable to specialized code generation models.

2.2 Functionality Similarity as a Consensus

The decomposition of complex tasks benefits from solving easier sub-goals, but might introduce
the risks of cascading errors. To mitigate this, we introduce Functional Consensus which aims at
reducing inconsistencies in program behavior. This is achieved by sampling multiple functions and
selecting the one that exhibits consensus, as measured by the aggregated similarity of functionality
between candidates, thus abating outlier functionalities.

3
Functionality Similarity A program specifies its functionality (or behavior) through the control
flow defined by its code semantics. However, comparing the functionalities between two programs
based on their semantics is somewhat challenging. By decomposing the requirement into functions,
F UN C ODER is able to view the function behavior as a black box that maps arguments into return
values. Considering two functions f and g with the same input domain D(f ) = D(g), we define the
similarity between them sim(f, g) as the identicalness of outputs when given the same input values.
Z
1 [f (x) = g(x)] X 1 [f (x) = g(x)]
sim(f, g) = ≈ (1)
x∈D(f ) |D(f )| |X|
x∈X|X∼D(f )

The similarity becomes 1 if and only if two functions output consistent values for all inputs: ∀x ∈
D(f ) : f (x) = g(x) ⇔ sim(f, g) = 1. We notice that the input domain D(f ) is unbounded in most
cases, making its measurement barely feasible in practice. Thus, we approximate it by sampling a
subset of possible inputs X ∼ D(f ) with an LLM.
Consensus is reached by selecting the candidate f ∗ holding maximal similarity with others after
sampling multiple function implementations F = {f(i) } for the same requirements.
X
f ∗ = F UN C ONSENSUS(F ) = arg max sim(f(i) , f(j) ) (2)
f(i) ∈F
f(j) ∈F \{f(i) }

By introducing functional consensus, F UN C ODER produces functions that are more consistent and
common in functionality, while omitting abnormal samples. The process is applied to not just the final
program, but also to every sub-tree during the bottom-up conquering stage, resulting in step-by-step,
thorough verification from the most fundamental functions all the way up to the whole program.

2.3 F UN C ODER is a Function Coder

We design F UN C ODER as a procedure that takes a problem in the form of a function signature f (x),
and produces a final solution f ∗ (x), as exemplified in Figure 1. Given a problem f (x), F UN C ODER
partially implements the function as f ′ (x) referring to unimplemented sub-functions g(y) and h(z).
These sub-functions are then fed into F UN C ODER to be recursively coped with. We then sample
′
k implementations f(i) (x) based on solved children g ∗ (y) and h∗ (z). Functional consensus is
calculated by evaluating candidates on possible inputs. The function sharing maximal behavioral
similarity is combined with solved children to formulate the final solution.

3 Experiments
We conduct experiments on competition-level code generation and mathematical reasoning bench-
marks with state-of-the-art LLMs, which are covered in section §3.1 and §3.2, respectively. In
addition to GPT models (Ouyang et al., 2022; OpenAI, 2023), we also conduct experiments with
community models like Llama38b (Meta AI, 2024), StableCode3b (Pinnaparaju et al., 2024), and
CodeLlama34b (Rozière et al., 2023). We use the instruct variant of these models and inference on a
single A100-80G under BF16 precision with vLLM (Kwon et al., 2023).

3.1 Code Generation

We choose three benchmarks for code generation evaluation: (a) HumanEval (Chen et al., 2021)
includes entry-level coding questions; (b) MBPP (Austin et al., 2021) contains questions of standard
library invocation and programming basics; and (c) xCodeEval (Khan et al., 2023) consists of
algorithmic challenges sourced from the competitive programming platform CodeForces.

3.1.1 Experiment Setup

Benchmarks We adopt the full test set (164 problems) for HumanEval, and sample 200 for MBPP
and 500 for xCodeEval, respectively. Following EbTech (2024), we split the xCodeEval into 4
subsets based on problem difficulty: Easy (≤ 1200), Mid (1200-1599), Hard (1600-1999) and Expert
(≥ 2000). The evaluation metric for code generation is Pass@1 unless specified.

4
Table 1: Experiment results on code generation benchmarks. We report Pass@1 as evaluate metric.
Results from the original paper are underlined, and the best results are bold.

HumanEval MBPP xCodeEval

Model Method
Pass@1 ∆↑ Pass@1 ∆↑ Easy Mid Hard Expert All
Standard 82.9 - 73.5 - 68.5 39.3 19.5 1.7 37.4
Parsel 85.0 +2.1 - - - - - - -
CodeT 90.9 +8.0 77.0 +3.5 76.4 51.8 21.8 3.4 44.0
GPT-4
Reflexion 91.0 +8.1 77.1 +3.6 71.3 41.1 19.5 2.5 38.6
MetaGPT 85.9 +3.0 - - - - - - -
F UN C ODER 94.5 +11.6 79.5 +6.0 83.1 58.0 26.4 3.4 48.6
Standard 68.3 - 72.0 - 44.4 15.2 4.6 0.0 20.2
CodeT 81.1 +12.8 76.0 +4.0 50.6 16.1 8.0 0.0 23.2
GPT-3.5 Reflexion 69.5 +1.2 72.5 +0.5 44.4 17.0 5.7 0.0 20.6
LDB 82.9 +14.6 76.0 +4.0 - - - - -
F UN C ODER 85.4 +17.1 78.5 +6.5 62.4 29.5 11.6 0.0 31.4

Baselines We compare F UN C ODER with standard prompting (Brown et al., 2020), two-stage
decomposition method Parsel (Zelikman et al., 2023), self-testing method CodeT (Chen et al., 2023a),
self-improvement methods Reflexion and LDB (Shinn et al., 2023; Zhong et al., 2024), and multi-
agent developing framework MetaGPT (Hong et al., 2024). We implement Standard prompting with
a 1-shot demonstration. CodeT samples 11 solutions with standard prompting and evaluates them on
model-generated tests. The results for Reflexion are reproduced from the original code.
Implementation Details F UN C ODER uses a 2-shot prompt in the divide stage and 1-shot for
conquering sub-functions. The number of sampled implementations in the functional consensus is set
to 11 for code generation tasks. For further implementation details, please refer to Appendix A.1.

3.1.2 Results
Table 1 shows the code generation performance on advanced proprietary models, GPT-3.5 (Ouyang
et al., 2022) and GPT-4 (OpenAI, 2023). For basic programming questions, HumanEval and MBPP,
F UN C ODER surpass previous SOTA methods by +3.3% in Pass@1 and reduce the error rate by 18.6%.
Furthermore, F UN C ODER demonstrates a substantial improvement on competition-level problems,
outperforming others by 10.4% in GPT-4 and 35.3% with GPT-3.5. We observe that F UN C ODER can
enhance LLM’s capability of solving more complex programming tasks, with an average accuracy
improvement of 82.3% over the baseline on the Mid and Hard subsets of xCodeEval. Expert level
programs, however, still remain a colossal challenge for even the most cutting-edge LLMs.

Table 2: Code generation performance with open-source models on HumanEval.

Model Category Param Standard CodeT Reflexion F UN C ODER

Llama3 Text/Chat 8B 61.6 68.9 59.1 79.9 (+11.0)
StableCode Code 3B 61.0 75.0 61.6 81.0 (+6.0)
CodeLlama Code 34B 43.9 55.5 41.5 66.5 (+11.0)

Evaluation is also performed over community LLMs, Llama3 (Meta AI, 2024), StableCode (Pinna-
paraju et al., 2024), and CodeLlama (Rozière et al., 2023) with results in Table 2, 10. F UN C ODER
consistently boosts the performance of smaller models in code generation, with an averaged improve-
ment of +38.0% compared to standard prompting, and outperforms the previous best method CodeT
by +14.6% on HumanEval. Experiment results demonstrate that our method archives state-of-the-art
performance on various models, ranging from basic programming to competition contests.

3.1.3 Analysis
F UN C ODER Democratize to Smaller LLMs Limited by the LLM capabilities, the application of self-
improvement or multi-agent methods on smaller models is without ease. By keeping decomposition

5
(a) Preliminary Study on Self-testing (b) Effectiveness of Ranking Strategy
90.2
final passed 43.9% 44.5% 90
85.4
final failed 4.3% 19.5% 85

Pass@k
program wrong 12.8% 5.5% 80.5 80
Strategy
Self-Test Result consensus
unit-test wrong 25.0% 16.5% 75
passed self-test
both incorrect 14.0% 14.0% failed 69.5 random 70

50 40 30 20 10 0 0 10 20 30 40 50 1 2 3 4 5 6 7 8 9 10 11
GPT-3.5 StableCode3b Num Selected Programs

Figure 3: (a) Preliminary study on self-testing, the programs are evaluated using unit-tests generated
by LLMs. (b) The effectiveness of different ranking strategies. We compute the Pass@k over top-k
programs ranked by functional consensus, self-test, and random on 11 candidates. (higher is better)

and composition within the code generation process, our approach exhibits better generalization.
As shown in Table 1, 2, with F UN C ODER, Llama38b and StableCode3b achieve around 1.18×
performance to standard GPT-3.5, and are closely aligned with GPT-4 by about 97% on HumanEval.
Preliminary Study on Self-Testing Method We conduct a preliminary study targeting the self-testing
method on HumanEval, results are shown in Figure 3.a with further details in Appendix A.5. We first
verify whether model-generated programs can also pass model-generated self-tests: (a) If a program
passes self-tests, most from GPT-3.5 would also work on system tests, as much as 19.5%/64% ≈ 30.5%
programs from StableCode are rejected, indicating that smaller models like StableCode may not
effectively self-test and detect program errors on its own. (b) In the event of failed self-tests, a large
portion of failures are attributed to issues in self-tests instead of the programs, on both GPT-3.5
and StableCode. These phenomena indicate that self-testing methods have limitations in generating
correct and reliable unit tests. As a result, we design functional consensus to not require any assertion,
but perform mutual verification between solutions instead, as opposed to self-testing.
Effectiveness of Functional Consensus Functional consensus or self-testing may be viewed as
ranking algorithms for selecting functions. To measure ranking effectiveness, we conduct an analysis
on HumanEval with GPT-3.5. For each problem, 11 candidates are ranked with 3 strategies: consensus,
self-test, and random shuffle (as a baseline). Effectiveness is measured via Pass@k, i.e. if any of the
top-k ranked programs pass the system test. Figure 3.b shows that functional consensus achieves
94.7% upper bound (Pass@11) performance by selecting a single function (Pass@1), and is close
to that of self-test on Pass@4. This clearly demonstrates that functional consensus can effectively
evaluate correctness and pick the most promising implementation on the first attempt.

Table 3: Ablation study of F UN C ODER on HumanEval with GPT-3.5. The setting in our main
experiment is highlighted in bold. Tokens are calculated as the sum of prompts and completions.

Setting Divide Conquer Ranking Pass@1 Avg. Tokens

Standard ✗ ✗ ✗ 68.3 886.7
One-pass ✓ ✗ ✗ 72.6 (+4.3) 1233.7
Two-pass ✓ ✓ ✗ 78.7 (+10.4) 3343.2
Two-pass + ST@11 ✓ ✓ Self-Test@11 80.5 (+12.2) 5408.3
F UN C ODER@5 ✓ ✓ Consensus@5 83.5 (+15.2) 4040.8
F UN C ODER@11 ✓ ✓ Consensus@11 85.4 (+17.1) 5402.0

Ablation and Token Usage To analyze the impact of dividing, conquering, and functional consensus
in F UN C ODER, we carry out an ablation study with different settings. A study that replaces consensus
with self-testing is also included. The ablation is constructed on HumanEval with GPT-3.5, as
shown in Table 3. We observe that function decomposition and re-composition deliver cumulative
performance improvements. Functional consistency is also shown to prevail over self-testing. Putting
them all together, F UN C ODER received a +17.1 improvement with 5.09× more tokens over baseline.
Compared to previous SOTA LDB (≈ 23K tokens), we are able to gain +2.5 in performance with
76.5% token usage reduction.

6
Table 4: Experimental results on MATH, a competition-level mathematical reasoning benchmark.
Best results are in bold. Text-based reasoning methods are denoted with † , while others use program-
aided reasoning. We report both overall results and results in seven subjects: Prealgebra, Algebra,
Number Theory, Counting & Probability, Geometry, Intermediate Algebra, and Precalculus.

Model Method Prealg. Alg. NT Prob. Geo. InterAlg. Precalc. Overall

†
Standard 81.7 82.7 71.1 72.3 59.5 46.7 47.3 68.2
CoT† 84.1 87.1 62.2 68.1 45.2 48.9 54.5 68.6
GPT-4 PoT 79.3 80.6 75.6 72.3 50.0 47.8 58.2 68.2
Self-Refine 82.9 82.0 77.8 76.6 54.8 55.6 63.6 72.2
CR 86.6 86.3 88.7 71.1 53.7 51.5 51.8 72.2
F UN C ODER 89.0 92.8 82.2 83.0 59.5 63.3 56.4 78.2
Standard† 62.2 37.4 20.0 29.8 31.0 24.4 21.8 34.6
CoT† 59.8 51.1 28.9 29.8 28.6 26.7 30.9 40.0
GPT-3.5 PoT 68.3 50.4 33.3 48.9 21.4 18.2 29.1 41.0
Self-Refine 74.4 49.6 48.9 57.4 28.6 35.6 36.4 48.6
F UN C ODER 76.8 61.2 55.6 59.6 34.1 36.0 41.8 54.0
CoT† 56.1 47.5 31.1 34.0 40.5 14.4 38.2 38.6
Llama38b PoT 67.1 32.4 24.4 34.0 16.7 21.1 18.2 32.6
F UN C ODER 67.9 45.7 51.1 53.2 19.0 37.8 30.9 45.0
PoT 20.7 14.4 17.8 25.5 4.8 8.9 9.1 14.4
StableCode3b
F UN C ODER 46.3 30.2 20.0 29.8 4.8 20.0 18.2 26.6
PoT 35.5 26.1 15.0 16.7 0.0 5.5 33.3 15.2
CodeLlama34b
F UN C ODER 44.8 46.1 37.8 34.1 13.6 24.6 37.5 24.4

3.2 Mathematical Reasoning

Code can be viewed as a tool for augmenting the reasoning capabilities of LLMs (Chen et al., 2023b).
Alternative to text-based reasoning like Chain-of-Thought (Wei et al., 2022), programs can offer
unique advantages in terms of iteration and calculations. To test the generalizability of F UN C ODER
beyond algorithm challenges, we conduct an experiment on MATH (Hendrycks et al., 2021b), a
competition-level mathematical reasoning benchmark.

3.2.1 Experiment Setup

Benchmark The experiment is conducted on a subset of the MATH test set, including 500 randomly
sampled problems that can be classified into 7 disjoint subjects or 5 difficulty levels. It can be noticed
that labels in MATH are formatted in LATEX, rendering exact-match verdicts impractical. We, therefore,
follow previous work (Zhang et al., 2024) and adopt GPT-4 to determine the correspondence between
predictions and labels, with further details provided in Appendix A.4.
Baselines We compare F UN C ODER with the text-based baselines: Standard Prompting and Chain-
of-Thought (Wei et al., 2022), and program-aided baselines: Program-of-Thought (Chen et al.,
2023b), Self-Refine (Madaan et al., 2023), Cumulative Reasoning (Zhang et al., 2024). The results of
Cumulative reasoning are reported in the original paper. Standard prompting and chain-of-thought
reasoning use 7-shot demonstrations constructed from the train set. Program-of-Thought and Self-
Refine prompt the model with 1-shot demonstration to generate a solution() function that solves
the problem. Additionally, self-refine iteratively refines programs based on runtime feedback. All
baseline methods are run with self-consistency (Wang et al., 2023) at 5.
Implementation Details F UN C ODER adopts a program-aided reasoning setting that writes a
solution() function and obtains the final prediction by running this program. The number of
sampled implementations |F | in functional consensus is set to 5 to match baseline methods.

3.2.2 Results
The experimental results on MATH are shown in Table 4. It shows that program-aided reasoning
generally outperforms text-based reasoning. With GPT-4 as the backbone, F UN C ODER outperforms
the strongest baseline Cumulative Reasoning (Zhang et al., 2024) by (6.0 / 8.3%) and surpasses the

7
vanilla program-aided baseline PoT (Chen et al., 2023b) by (10.0 / 14.7%). When using GPT-3.5-
turbo as the backbone, F UN C ODER exceeds the strongest baseline by (6.2 / 11.1%) and outperforms
PoT by as much as (13.0 / 31.7%), which indicates that our approach has a strong advantage over
both text-based reasoning and other program-aided reasoning methods.
On open-source models, F UN C ODER with Llama3 outperforms PoT by (12.4 / 38.0%). It has even
reached competitive performance against the state-of-the-art method based on GPT-3.5 (45.0 v.s.
48.6). When employing StableCode and CodeLLaMA as the backbone, our approach achieves
significant improvements by (12.2 / 84.7%) and (9.2 / 60.5%), respectively. This improvement
demonstrates that our approach can significantly boost smaller LLMs, democratizing the complex
reasoning capabilities of open-source LLMs through programming.

3.2.3 Analysis
GPT-3.5 StableCode3b
F UN C ODER Can Handle Harder Questions 80 Ours 50
Figure 4 compares between CoT, PoT, and F UN - PoT
40
60
C ODER across varying difficulty levels. It illus- CoT
30
trates that CoT performs comparatively well on 40 20
the easiest questions, but suffers from a steep 20 10
decline in performance as difficulty increases. 0 0
This suggests that text-based reasoning is inad- l1 l2 l3 l4 l5 l1 l2 l3 l4 l5
L eve L eve L eve L eve L eve L eve Leve Leve Leve Leve
equate for tackling challenging mathematical
reasoning problems. The same situation is also Figure 4: Average accuracy in each level with
observed in PoT. In contrast, our method consis- the chat model (GPT-3.5) and the code model
tently demonstrates high performance even on (StableCode ) on the MATH benchmark.
3b
challenging problems, particularly excelling on
level 5 difficulty with nearly double the perfor-
mance compared to PoT and CoT. This reflects that our method, with divide-and-conquer applied,
can effectively cope with complex problems.
Decomposed Functions are Domain-Specific We hypothesize that questions from the same subject
require similar knowledge reserves, which should be reflected in the functionality of the sub-functions.
To verify this hypothesis, we statisticize the common sub-functions of F UN C ODER in each MATH
subject, as shown in Table 5. It is apparent that different subjects require different abilities, each
with its own set of sub-functions closely associated with the domain knowledge. In addition, these
common sub-functions are fundamentally basic and straightforward. As exemplified in Appendix B.2,
our method is able to leverage and combine these basic sub-functions to achieve more complex goals,
thereby reducing the complexity of reasoning and enhancing performance.

Table 5: Top-3 most commonly used functions in each subject of MATH, listed in descending order.

Subject Functions
Prealgebra is_prime / factorial / gcd
Algebra find_roots / is_perfect_square / find_domain
Number Theory get_divisors / mod_inverse / gcd
Counting & Probability factorial / combinations / binomial_coefficient
Geometry distance / simplify_fraction / calculate_triangle_area
Intermediate Algebra find_roots / evaluate_polynomial / lagrange_interpolation
Precalculus cross_product / fraction_from_angle / dot

4 Related Work
Large Language Model for Code Code pre-training has received widespread attention, with early
models based on small language models (SLM) (Feng et al., 2020; Lu et al., 2021; Wang et al., 2021).
In recent years, with the development of large-scale pre-training techniques, code LLM has emerged,
showing remarkable performance in downstream code tasks (Chen et al., 2021; Nijkamp et al., 2023;
Li et al., 2022; Rozière et al., 2023; Li et al., 2023b; Guo et al., 2024). Tasks between code and
natural language (NL) can be generally divided into three major categories: NL2Code tasks such as

8
code generation (Austin et al., 2021; Chen et al., 2021; Hendrycks et al., 2021a; Khan et al., 2023)
and code search (Husain et al., 2019a); Code2Code tasks including code completion (Lu et al., 2021;
Zhang et al., 2023a; Liu et al., 2024), code translation (Ahmad et al., 2023; Zhu et al., 2022; Yan
et al., 2023), and test generation (Siddiq et al., 2023; Schäfer et al., 2024); Code2NL tasks like code
summarization (Husain et al., 2019b; Jin et al., 2023). This paper focuses on code generation tasks,
ranging from basic to competition level.
Code Refinement and Self-Testing Code doesn’t always run as expected; it could contain syntax
errors, dead loops, or bugs. It’s essential to debug and refine the code to ensure better quality.
CodeT (Chen et al., 2023a) generates unit-tests to score the implementation. Self-improvement
methods (Madaan et al., 2023; Shinn et al., 2023; Chen et al., 2024; Zhong et al., 2024) design
closed-loop procedures that repeatedly refine the code based on the feedback. Like real-life software
development processes, multi-agent frameworks (Hong et al., 2024; Qian et al., 2023) construct
specific LLM roles, Tester or QA to generate tests. These studies adopt a shared paradigm wherein
self-tests are generated through LLMs. However, Olausson et al. (2024) points out the challenge that
LLMs have certain shortcomings in self-repairing their code. This paper avoids these shortcomings
by proposing functional consensus as a reliable method of evaluation.
Program-Aided Reasoning and Agents Aside from code generation tasks, the program can be
a tool that augments LLM to solve complex reasoning questions or interact with external environ-
ments. Program-of-Thought (Chen et al., 2023b) and PAL (Gao et al., 2023) prompt the model to
generate a program that solves mathematical or symbolic problems. MathPrompter (Imani et al.,
2023) and Chain-of-Code (Li et al., 2023a) fuse the text-based chain-of-thought with code-based
program-of-thought prompting to complement each other in mathematical reasoning. Cumulative
Reasoning (Zhang et al., 2024) conducts bottom-up reasoning to derive the final answer progres-
sively. Numerous work (Sun et al., 2023; Wang et al., 2024; Yang et al., 2024) also use code as an
intermediate component to bridge LLM agents with external environments.
Decompose for Complex Problems Several recent works employ decomposition to reduce the
complexity of hard problems. Least-to-Most (Zhou et al., 2023) adopts a two-stage approach, which
first decomposes complex problems, and then solves each sub-problem individually to tackle complex
reasoning tasks. Successive Prompting (Dua et al., 2022) adopts a dynamic decomposition, iteratively
breaking down problems and addressing sub-problems. Tree-of-Thought (Yao et al., 2023) breaks
down complex problems into state spaces and uses tree search to solve them. Parsel (Zelikman
et al., 2023) introduces decomposition to code generation tasks, taking a three-stage to break down
requirements into draft and intermediate parsel programs. RepoCoder (Zhang et al., 2023b) performs
a retrieval in repositories to complete unfinished code one by one. Unlike these methods, F UN C ODER
recursively decomposes problems into a tree structure, hence gradually reduces its complexity.

5 Discussion
Limitations Our approach unleashes the potential power of functions in programming, which is
advantageous on well-defined problems such as competitive programming, or program-augmented
reasoning tasks. These scenarios do not however represent all use cases, such as open-ended problems
or casual software development. Nevertheless, we believe that the idea of divide-and-conquer and
sub-modular consensus utilized by F UN C ODER can be extended to a wider range of problems, and
we consider this as a future exploration.
Broader Impact While code generation is increasingly utilized in software development, Large
Language Models (LLMs) are still prone to generating toxic, vulnerable, or malicious code. Such
programs pose risks and should be used or executed with extra caution.

6 Conclusion
In this paper, we presented F UN C ODER, a novel code generation framework that integrates the divide-
and-conquer strategy with functional consensus to address complex requirements. F UN C ODER had
demonstrated superior performance compared to state-of-the-art methods on various benchmarks
and models. Our findings highlighted the effectiveness of dynamic decomposition and functional
consensus in writing complex code, which suggests that F UN C ODER may have the potential to
empower further improvements in code generation and other fields.

9
References
Wasi Uddin Ahmad, Md Golam Rahman Tushar, Saikat Chakraborty, and Kai-Wei Chang. AVATAR: A parallel
corpus for Java-python program translation. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki
(eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp. 2268–2281, Toronto,
Canada, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.143. URL
https://aclanthology.org/2023.findings-acl.143.

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen
Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language
models. ArXiv preprint, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia
Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Process-
ing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020,
December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/
1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney,
Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feldman, Arjun Guha, Michael Greenberg,
and Abhinav Jangda. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation.
IEEE Trans. Software Eng., 49(7):3675–3691, 2023. doi: 10.1109/TSE.2023.3267446. URL https:
//doi.org/10.1109/TSE.2023.3267446.

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet:
Code generation with generated tests. In The Eleventh International Conference on Learning Representations,
ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023a. URL https://openreview.net/
forum?id=ktrw68Cmu9c.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan,
Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger,
Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder,
Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet,
Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-
Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir
Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam,
Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer,
Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba.
Evaluating large language models trained on code. ArXiv preprint, abs/2107.03374, 2021. URL https:
//arxiv.org/abs/2107.03374.

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling
computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research,
2023b. ISSN 2835-8856. URL https://openreview.net/forum?id=YfZ4ZPt8zd.

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug.
In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna Austria, May 7-11,
2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=KuPixIqPiq.

Ole-Johan Dahl, Edsger W. Dijkstra, and Charles Antony Richard Hoare. Structured programming, volume 8 of
A.P.I.C. Studies in data processing. Academic Press, 1972. ISBN 978-0-12-200550-3.

Dheeru Dua, Shivanshu Gupta, Sameer Singh, and Matt Gardner. Successive prompting for decomposing
complex questions. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022
Conference on Empirical Methods in Natural Language Processing, pp. 1251–1265, Abu Dhabi, United Arab
Emirates, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.81. URL
https://aclanthology.org/2022.emnlp-main.81.

EbTech. How to Interpret Contest Ratings - Codeforces, 2024. URL https://codeforces.com/blog/

entry/68288.

10
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting
Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre-trained model for programming and natural languages.
In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics:
EMNLP 2020, pp. 1536–1547, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/
2020.findings-emnlp.139. URL https://aclanthology.org/2020.findings-emnlp.139.

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig.
PAL: program-aided language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara
Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML
2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research,
pp. 10764–10799. PMLR, 2023. URL https://proceedings.mlr.press/v202/gao23f.html.

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu,
Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model
meets programming - the rise of code intelligence. ArXiv preprint, abs/2401.14196, 2024. URL https:
//arxiv.org/abs/2401.14196.

P.R. Halmos. Naive Set Theory. Undergraduate Texts in Mathematics. Springer New York, 1998. ISBN
9780387900926. URL https://books.google.com.hk/books?id=x6cZBQ9qtgoC.

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns,
Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence
with APPS. In Joaquin Vanschoren and Sai-Kit Yeung (eds.), Proceedings of the Neural Information
Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December
2021, virtual, 2021a. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/
hash/c24cd76e1ce41366a4bbe8a49b02a028-Abstract-round2.html.

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song,
and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin
Vanschoren and Sai-Kit Yeung (eds.), Proceedings of the Neural Information Processing Systems
Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, vir-
tual, 2021b. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/
be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html.

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili
Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and
Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. In The
Twelfth International Conference on Learning Representations, ICLR 2024, Vienna Austria, May 7-11, 2024.
OpenReview.net, 2024. URL https://openreview.net/forum?id=VtmBAGCN7o.

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet
challenge: Evaluating the state of semantic code search. ArXiv preprint, abs/1909.09436, 2019a. URL
https://arxiv.org/abs/1909.09436.

Shima Imani, Liang Du, and Harsh Shrivastava. MathPrompter: Mathematical reasoning using large language
models. In Sunayana Sitaram, Beata Beigman Klebanov, and Jason D Williams (eds.), Proceedings of the
61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pp. 37–42,
Toronto, Canada, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-industry.4.
URL https://aclanthology.org/2023.acl-industry.4.

Xue Jiang, Yihong Dong, Lecheng Wang, Qiwei Shang, and Ge Li. Self-planning code generation with large
language model. ArXiv preprint, abs/2303.06689, 2023. URL https://arxiv.org/abs/2303.06689.

Xin Jin, Jonathan Larson, Weiwei Yang, and Zhiqiang Lin. Binary code summarization: Benchmarking
chatgpt/gpt-4 and other large language models. ArXiv preprint, abs/2312.09601, 2023. URL https:
//arxiv.org/abs/2312.09601.

Mohammad Abdullah Matin Khan, M. Saiful Bari, Xuan Long Do, Weishi Wang, Md. Rizwan Parvez, and
Shafiq R. Joty. xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation,
translation and retrieval. ArXiv preprint, abs/2303.03004, 2023. URL https://arxiv.org/abs/2303.
03004.

11
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon-
zalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with
pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.

Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei,
Fei Xia, and Brian Ichter. Chain of code: Reasoning with a language model-augmented code emulator. ArXiv
preprint, abs/2312.04474, 2023a. URL https://arxiv.org/abs/2312.04474.

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc
Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas
Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas
Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin
Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy V, Jason Stillerman, Siva Sankalp Patel, Dmitry
Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Moustafa-Fahmy, Urvashi Bhattacharyya,
Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel
Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri
Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish
Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean
Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder: may the source be
with you! ArXiv preprint, abs/2305.06161, 2023b. URL https://arxiv.org/abs/2305.06161.

Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles,
James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume,
Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Mol-
loy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu,
and Oriol Vinyals. Competition-level code generation with alphacode. ArXiv preprint, abs/2203.07814, 2022.
URL https://arxiv.org/abs/2203.07814.

Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-
completion systems. In The Twelfth International Conference on Learning Representations, ICLR 2024,
Vienna Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=
pPjZIOuQuF.

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang,
Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes
Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal,
Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß,
Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru
Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex
Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian J. McAuley,
Han Hu, Torsten Scholak, Sébastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados,
and et al. Starcoder 2 and the stack v2: The next generation. ArXiv preprint, abs/2402.19173, 2024. URL
https://arxiv.org/abs/2402.19173.

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement,
Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tu-
fano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie
Liu. Codexglue: A machine learning benchmark dataset for code understanding and generation. In
Joaquin Vanschoren and Sai-Kit Yeung (eds.), Proceedings of the Neural Information Processing Sys-
tems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021,
virtual, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/
c16a5320fa475530d9583c34fd356ef5-Abstract-round1.html.

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha
Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine
Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-
feedback. In Thirty-seventh Conference on Neural Information Processing Systems, NeurIPS 2023, 2023.
URL https://openreview.net/forum?id=S37hOerQLB.

Meta AI. Meta Llama 3 - homepage, 2024. URL https://llama.meta.com/llama3/.

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming
Xiong. Codegen: An open large language model for code with multi-turn program synthesis. In The
Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
OpenReview.net, 2023. URL https://openreview.net/forum?id=iaYcJKpY2B_.

12
Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. Is
self-repair a silver bullet for code generation? In The Twelfth International Conference on Learning
Representations, ICLR 2024, Vienna Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://
openreview.net/forum?id=y0GJXRungR.

OpenAI. GPT-4 technical report. ArXiv preprint, abs/2303.08774, 2023. URL https://arxiv.org/abs/
2303.08774.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin,
Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser
Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan
Leike, and Ryan Lowe. Training language models to follow instructions with human feed-
back. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/
b1efde53be364a73914f58805a001731-Abstract-Conference.html.

Nikhil Pinnaparaju, Reshinth Adithyan, Duy Phung, Jonathan Tow, James Baicoianu, Ashish Datta, Maksym
Zhuravinskyi, Dakota Mahan, Marco Bellagente, Carlos Riquelme, and Nathan Cooper. Stable code technical
report. ArXiv preprint, abs/2404.01226, 2024. URL https://arxiv.org/abs/2404.01226.

Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong
Sun. Communicative agents for software development. ArXiv preprint, abs/2307.07924, 2023. URL
https://arxiv.org/abs/2307.07924.

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu
Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian
Canton-Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo
Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open
foundation models for code. ArXiv preprint, abs/2308.12950, 2023. URL https://arxiv.org/abs/2308.
12950.

Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. An empirical evaluation of using large language
models for automated unit test generation. IEEE Trans. Software Eng., 50(1):85–105, 2024. doi: 10.1109/
TSE.2023.3334955. URL https://doi.org/10.1109/TSE.2023.3334955.

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: language
agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing
Systems, NeurIPS 2023, 2023. URL https://openreview.net/forum?id=vAElhFcKW6.

Mohammed Latif Siddiq, Joanna C. S. Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and
Vinicius Carvalho Lopes. Exploring the effectiveness of large language models in generating unit tests. ArXiv
preprint, abs/2305.00418, 2023. URL https://arxiv.org/abs/2305.00418.

Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. Adaplanner: Adaptive planning from
feedback with language models. In Thirty-seventh Conference on Neural Information Processing Systems,
NeurIPS 2023, 2023. URL https://openreview.net/forum?id=rnKgbKmelt.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya
Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao,
Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas,
Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux,
Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov,
Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan
Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang,
Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela
Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas
Scialom. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023. URL
https://arxiv.org/abs/2307.09288.

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code
actions elicit better LLM agents. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024.
URL https://openreview.net/forum?id=8oJyuXfrPv.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery,
and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The
Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
OpenReview.net, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw.

13
Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. CodeT5: Identifier-aware unified pre-trained
encoder-decoder models for code understanding and generation. In Marie-Francine Moens, Xuanjing
Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing, pp. 8696–8708, Online and Punta Cana, Dominican Republic,
2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.685. URL https:
//aclanthology.org/2021.emnlp-main.685.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le,
and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo,
S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information
Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022,
New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_
files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac,
Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s transformers: State-of-the-art
natural language processing. ArXiv preprint, abs/1910.03771, 2019. URL https://arxiv.org/abs/1910.
03771.

Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, and Wen Wang. CodeTransOcean: A comprehensive
multilingual benchmark for code translation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of
the Association for Computational Linguistics: EMNLP 2023, pp. 5067–5089, Singapore, 2023. Association
for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.337. URL https://aclanthology.
org/2023.findings-emnlp.337.

Ke Yang, Jiateng Liu, John Wu, Chaoqi Yang, Yi Fung, Sha Li, Zixuan Huang, Xu Cao, Xingyao Wang, Heng Ji,
and ChengXiang Zhai. If LLM is the wizard, then code is the wand: A survey on how code empowers large
language models to serve as intelligent agents. In ICLR 2024 Workshop on Large Language Model (LLM)
Agents, 2024. URL https://openreview.net/forum?id=8dmNOD9hbq.

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R Narasimhan.
Tree of thoughts: Deliberate problem solving with large language models. In Thirty-seventh Conference on
Neural Information Processing Systems, NeurIPS 2023, 2023. URL https://openreview.net/forum?
id=5Xc1ecxO1h.

Eric Zelikman, Qian Huang, Gabriel Poesia, Noah Goodman, and Nick Haber. Parsel: Algorithmic reasoning
with language models by composing decompositions. In Thirty-seventh Conference on Neural Information
Processing Systems, NeurIPS 2023, 2023. URL https://openreview.net/forum?id=qd9qcbVAwQ.

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu
Chen. RepoCoder: Repository-level code completion through iterative retrieval and generation. In Houda
Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in
Natural Language Processing, pp. 2471–2484, Singapore, 2023a. Association for Computational Linguistics.
doi: 10.18653/v1/2023.emnlp-main.151. URL https://aclanthology.org/2023.emnlp-main.151.

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu
Chen. RepoCoder: Repository-level code completion through iterative retrieval and generation. In Houda
Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in
Natural Language Processing, pp. 2471–2484, Singapore, 2023b. Association for Computational Linguistics.
doi: 10.18653/v1/2023.emnlp-main.151. URL https://aclanthology.org/2023.emnlp-main.151.

Yifan Zhang, Jingqin Yang, Yang Yuan, and Andrew Chi-Chih Yao. Cumulative reasoning with large language
models. In ICLR 2024 Workshop on Bridging the Gap Between Practice and Theory in Deep Learning, 2024.
URL https://openreview.net/forum?id=XAAYyRxTlQ.

Lily Zhong, Zilong Wang, and Jingbo Shang. LDB: A large language model debugger via verifying runtime
execution step-by-step. ArXiv preprint, abs/2402.16906, 2024. URL https://arxiv.org/abs/2402.
16906.

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire
Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning
in large language models. In The Eleventh International Conference on Learning Representations, ICLR
2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?
id=WZH7099tgfM.

14
Ming Zhu, Karthik Suresh, and Chandan K. Reddy. Multilingual code snippets training for program translation.
In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative
Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in
Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pp. 11783–11790. AAAI Press,
2022. URL https://ojs.aaai.org/index.php/AAAI/article/view/21434.

15
A Appendix
In the supplementary materials, we provide the details of implementation (A.1), baseline information
and settings (A.2), benchmarks (A.3), metrics (A.4), settings in the analysis (A.5), and additional
experiments (A.6). We also demonstrate the example solutions of our method and baseline in
Appendix B, and include all the prompts in Appendix C.

Table 6: Symbols and Glossary.

Alias Description
(i) Symbols
f (x) Function In the programming language, a function consists of header,
documentation, and its body {hf , df , bf }. A function can
also be viewed as a mapping f : D(f ) → Y .
hf Function Header Declares the function name, arguments, and return type, and
is used as a signature to identify the function in a program.
df Function Docstring Provides additional usage details for this function, but is
(or Documentation) optional. We encourage the model to generate docstrings to
describe sub-goals precisely.
bf Function Body The function body contains a subroutine that describes its
(or Implementation) control flow and behavior. Functions may be invoked from
within.
f ′ (x) Partially Implemented A provisional function structure generated by the LLM
where sub-procedures are not yet implemented.
f ∗ (x) Solved Function A final implementation that is no longer changed and rep-
resents F UN C ODER’s final comprehension and solution on
the original problem.
Functions that re-implement f ′ (x) based on solved sub-

F = f(i) Sampled Implementation
functions, generated by models using the same input prompt.
C HILD(f (x)) Dependency Functions that are used in f (x). (exclude f (x) itself)
T Dependency Tree Defined by T REE(f, C HILD(f )), where f is the root node
of the current sub-task. Circular references are ignored.
F Function Composition To implement a certain function f respecting sub-procedures
as potentially reusable components.
(ii) Glossary
System Test Hidden Test System testing is a phase where a set of previously invisible
test cases are run against the submitted program to validate
if the code is correct and produces the expected output for
different categories of inputs.
Unit Test Assertion A unit test is an assertion consisting of given input and
expected output, whereas in Python, it takes the form of
assert func(x) == y.
Self-testing - Self-testing is an evaluation process that prompts the model
to generate unit tests (assertions) to assess the correctness
of the generated program.

A.1 Implementation Details

Models We access the OpenAI models GPT-3.5 (gpt-3.5-turbo-0613) and GPT-4 (gpt-4-1106-
preview) through the Azure API. Weights of community models Llama3 (Meta-Llama-3-8B-Instruct),
StableCode (stable-code-instruct-3b), and CodeLlama (CodeLlama-34b-Instruct-hf) are downloaded
from HuggingFace (Wolf et al., 2019) and are served over an OpenAI-like API on a single A100-80G
GPU under BF16 precision with vLLM (Kwon et al., 2023).

Divide We instruct the model to write the current function and introduce new functions with clearly
defined sub-goals. The prompt C.2 for the divide process includes two examples: one example needs
to involve new functions that are left unimplemented; and another where the sub-goal is simple

16
enough that no further decomposition is necessary. The model generates a Python code block with
a temperature of 0.2, and the code block will be extracted to represent a tree of functions with new
functions as the children of the current. We require that any new sub-function do not refer to existing
functions, to avoid circular references. This generation process will be attempted at most 3 times
until any valid code with a proper signature is found in the output. F UN C ODER then traverses the
function tree via depth-first search and restricts the max depth of the tree to 6.

Conquer We apply the composition of sub-functions to rewrite the parent function after all sub-
functions have been fully decomposed. Code for sub-functions is made visible to the LLM, which is
requested to rewrite the current function with a 1-shot demonstration (C.3). With functional consensus
applied, the model samples multiple implementations with a temperature of 0.8, and the one that
reaches consensus will be kept for further bottom-up processing.

Functional Consensus The functional consensus is applied in the conquer stage. Formally, Consen-
sus@k samples k-1 implementations in the conquer stage, and reuses the one produced in the divide
stage, resulting in a set F of k candidate programs. Then we prompt the model with 1-shot (C.4) to
generate potential inputs X for the given function and use them to feed and execute the program.
As described in Eq 2, when two functions output the same value in a given input, they will both
add 1 point to the overall similarity. A thrown exception or timeout during execution assigns -100
points to the candidate as it indicates potentially problematic code. Similar to self-testing methods,
we also leverage the example input/output at the root node to filter out candidates that have wrong
functionality. Finally, the one candidate with maximum scores over all inputs is selected, as it reaches
consensus with other implementations.

Hierarchical Code Interpreter Divide-and-conquer represents the problem hierarchy through

structured code. To gain insights of this information, we design an interpreter that syntactically parses
the generated output and organizes them into a graph of functions. We are thus able to decompose
complex tasks by representing sub-goals through the connections of multiple functions. LLMs may
produce vulnerable code even if prompted by trusted inputs, making direct execution or eval() on
generated code especially hazardous. Our framework addresses this with the use of a sandboxed
environment to contain untrusted code execution, preventing the LLM from hanging up or even
breaking the system.

A.2 Baseline Details

Standard Prompting conducts one-time generation and directly output the entire code or final
results. In code generation tasks, we use a 1-shot prompting setting with 0.3 temperature. For MATH,
we sample 1 question-answer pair per subject in the train set, resulting in a 7-shot prompt, and run
self-consistency (Wang et al., 2023) with consistency@5 and temperature 0.7.
CodeT (Chen et al., 2023a) samples multiple code solutions X and unit-tests Y . A unit test is
an assertion consisting of given input and expected output, whereas in Python it takes the form of
"assert func(x) == y", CodeT then checks the programs over self-tests and divides the functions
into sets; the score of such a set is defined as the number of functions within multiplied by the number
of succeeded tests. Finally, CodeT selects the function with the most agreement (in the biggest set).
Similar to the setting of F UN C ODER, we sample 11 candidate solutions with 0.8 temperature.
Parsel (Zelikman et al., 2023) consists of three generation stages: high-level sketch, Parsel program,
and final program. The Parsel program is an intermediate representation of code that describes and
organizes program structure. We report the result of HumanEval with GPT-4 from the original paper.
Reflexion (Shinn et al., 2023) is a closed-loop agent system that generates unit tests and iteratively
refines the program based on the self-test feedback. The results for GPT-4 on HumanEval and MBPP
are reported in the original paper. Based on officially released code3 , we test results with GPT-3.5 and
community models under the reflexion strategy with max_iters=2 and Pass@1. For the xCodeEval
benchmark, as it is judged through standard input/output, we wrap the standard input into function
arguments and obtain the return value as the output in the form of "def main(input_str: str)
-> str", and the sample input/output are also transformed to visible tests for reflexion process.
3
GitHub: noahshinn/reflexion

17
MetaGPT (Hong et al., 2024) employs a multi-agent strategy that assigns roles and encodes human-
like software development procedures. The scripts for reproducing the results were not made public
as of this paper was completed. Therefore, we include the original result for GPT-4 on the HumanEval
dataset under the with feedback setting.
LDB (Zhong et al., 2024) segments programs into basic blocks and tracks the values of intermediate
variables after each block throughout runtime execution, allowing large language models to verify
the correctness of smaller code units. We adopt the results as-is reported in the paper.
Chain-of-Thought Prompting (Wei et al., 2022) generates step-by-step reasoning leading to the
final output answer. The solution is formatted in LATEX, and use \boxed to mark the final answer. We
sample 1 shot per subject in the MATH train set, resulting in a 7-shot demonstration, and running
with consistency@5 and a temperature of 0.7.
Program-of-Thought (Chen et al., 2023b) utilizes the coding ability in LLMs to generate programs
rather than text-based solutions for reasoning tasks. In MATH, we hint the model with 1-shot
prompting to generate a solution() function that returns the final answer to the problem. The
program is then executed in a Python environment and obtains the return value. If an exception is
thrown during execution, the model will try to regenerate a new program until it succeeds or reaches
3 attempts. Similar to CoT, Program-of-Thought samples 5 programs at a temperature of 0.7 and
votes the final result.
Self-Refine (Madaan et al., 2023) iteratively prompts the model to give feedback and refine the
generated code based on it. Self-refine does not incorporate self-tests, and the refinement is conducted
solely on model feedback. In our preliminary study on HumanEval, this feedback is weak and cannot
improve performance. However, in MATH, the solution program can be executed without the need
for generated assertions. Thus, we extend the self-refine to capture the runtime error trace as feedback
and refine the code until it can run or exceed 3 retries.
Cumulative Reasoning (Zhang et al., 2024) starts from decomposing the input problem into propo-
sitions and conducts bottom-up reasoning until the final answer can be concluded. The results for
Cumulative Reasoning are reported in the original paper under with code setting.

A.3 Benchmark Details

Table 7: Overview and details of HumanEval, MBPP, xCodeEval, and MATH dataset.

HumanEval MBPP xCodeEval MATH

Task Code Generation Code Generation Programming Contest Mathematical Reasoning
Attribute - - tags, difficulty subject, level
Metric Pass@1 Pass@1 Pass@1 EM-gpt
# Sample (original) 164 427 7,635 5,000
# Sample (ours) 164 200 500 500
Entry func variant variant main() solution()
Input arguments arguments standard input n/a
Output return return standard output return
# Examples Tests ~2.8 0 ~2.1 n/a
# System Tests ~8.1 ~3.1 51.1 n/a

HumanEval (Chen et al., 2021) is a hand-crafted programming dataset designed to evaluate a

model’s code generation capability. It consists of 164 instances involving programming skills in
language comprehension, reasoning, algorithms, and simple mathematics. The problem contains 2.8
sample inputs and outputs on average in the function document, which can be leveraged to provide
additional guidance for the LLM to select or self-improve the programs. We conduct experiments on
all 164 instances using accuracy (Pass@1) as the evaluation metric. The details of the Pass@1 metric
are described in Appendix A.4.

MBPP (Austin et al., 2021) consists of fundamental Python programming problems, with a total of
974 examples covering Python programming basics, standard library usage, and related assessment.
Following Shinn et al. (2023), we adopt the mbpp-typed split from MultiPL-E (Cassano et al., 2023)

18
and sample 200 instances, using Pass@1 as the metric. The original prompt4 from MBPP includes all
hidden tests in the input problem, which may cause label leakage when using these tests to refine or
select programs. To ensure a fair comparison, MultiPL-E removes the test information in the prompt.

xCodeEval (Khan et al., 2023) is a competition-level multilingual and multitask benchmark con-
sisting of 17 programming languages. xCodeEval collects 25 million openly available samples
from codeforces.com, a platform for competitive programming. The data we use include problem
descriptions in problem_descriptions.jsonl and system tests from unittest_db.json which
consists of 7,635 competition problems and averaged 51.1 tests per problem. Note that the tests
in xCodeEval are crawled, some of them are incomplete due to the website context limit (they end
with an ellipsis and the further content is missing); we filter out problems having invalid test cases.
Based on the CodeForces Rating (EbTech, 2024), we categorize the problems by their difficulty: Easy
(≤ 1200), Mid (1200-1599), Hard (1600-1999), and Expert (≥ 2000). We sample 500 problems
from the full split with the basic filter rule mentioned above, resulting in Table 8. The CodeForces
problem has a different input/output style compared to HumanEval and MBPP; it scans input from
Standard Input and prints the answer to Standard Output. Therefore, we judge the program on system
tests using a CodeForces-style judger and use Pass@1 (the program must pass all system tests) as the
evaluation metric.

MATH (Hendrycks et al., 2021b) is a challenging competition-level mathematical reasoning dataset,

with problems and solutions formatted in LATEX. It covers seven categories: Prealgebra, Algebra,
Number Theory, Counting & Probability, Geometry, Intermediate Algebra, and Precalculus. The
original test set of MATH consists of 5000 samples, and we randomly sampled 500 problems as
shown in Table 8. In addition to text-based reasoning, writing programs is another promising way to
solve mathematical problems. These methods involve writing a main() or solution() function,
and executing the program to obtain the final answer. Through experiments on MATH, we aim to
demonstrate that F UN C ODER can enhance LLM’s ability to address complex mathematical problems
through programming.

Table 8: Number of test samples in (a) xCodeEval difficulty, (b) MATH level, (c) MATH subject.

Difficulty Ours Original Level Ours Original Subject Ours Original

Prealgebra 82 871
Easy 178 1428 Level 1 39 437 Algebra 139 1187
Mid 112 1319 Level 2 90 894 Number Theory 45 540
Hard 87 1453 Level 3 108 1131 Counting & Probability 47 474
Expert 118 3289 Level 4 116 1214 Geometry 42 479
n/a 5 146 Level 5 147 1324 Intermediate Algebra 90 903
Precalculus 55 546
Total 500 7635 Total 500 5000 Total 500 5000

A.4 Metrics

Pass@k When a program is passed (or accepted), it means that the program must pass all system
tests without errors and within the time limit. In our experiments, we set the time limit to 2.5
seconds. Pass@k judges k independent programs, and if any of them can pass, the result will be 1.
In most of our experiments, we use Pass@1 as the metric, as it reflects the accuracy of the method
framework achieved without feedback from humans. Pass@k, on the other hand, is equivalent to
filtering programs through hidden, human-annotated test labels.

EM-GPT The ground truth label in MATH is written in LATEX, and the accuracy between labels and
model predictions cannot be directly calculated through exact-match (EM). MATH provides a judge
program5 that preprocesses LaTeX syntax and check whether two disambiguated strings are equal.
However, this is insufficient for evaluating LaTeX-formatted labels with variant program outputs. We
follow the evaluation criteria from previous work (Zhang et al., 2024), using GPT-4 to assess the
consistency between predictions and ground truths, with prompt shown in C.6.
4
Original MBPP prompt: https://github.com/google-research/google-research/tree/master/mbpp
5
math_equivalence: https://github.com/hendrycks/math/blob/main/modeling/math_equivalence.py

19
A.5 Details of Analysis

Details of Preliminary Analysis on Self-testing (Figure 3.a) The preliminary study is conducted
on the HumanEval dataset, which includes system tests S to evaluate the accuracy of the program, as
well as one human-annotated canonical solution c. For each question, we: (1) Obtain one solution
program p from Standard Prompting. (2) Prompt the model to generate 7 self-tests T based on the
question and entry function. The self-test is in the form of the unit test assert f(x) == y. We
then judge the generated program p and canonical solution c over the self-tests T and system tests S.
Formally, a pair (x, Y ) is used to identify whether program x passes test Y . Where (p, S) indicates
that the program can pass the system tests, demonstrating its correctness. And ¬(c, T ) means the
canonical solution can not pass self-tests, suggesting that the tests generated by model could be
wrong. The self-test results on generated programs are first evaluated and divided into two classes:
self-test passed or failed. If the self-test passes, the self-improvement methods will stop iteration and
pick this program as a final result. The next step is to determine whether the program can pass system
tests. If the self-test fails, it indicated that there could be an error in the program or test itself. In this
case, the correctness of the program is checked using final tests (p, S) and the correctness of the unit
test by canonical program (c, T ). The results on GPT-3.5 and StableCode are shown in Figure 3 and
detailed explanations about these conditions can be found in Table 9.

Table 9: Explanation on how we classify cases in self-testing preliminary study.

Class Subclass Condition Explanation

The self-test result is consistent with the final judge. How-
final ever, self-testing methods cannot improve performance
(p, T ) ∧ (p, S)
self-test passed in this case, as the program from the baseline (Standard
passed Prompt) is already correct.
final Self-test is too weak to detect errors in the program, there
(p, T ) ∧ ¬(p, S)
failed could be edge cases that not been considered.
This is a good example that self-testing detects errors
program
¬(p, T ) ∧ ¬(p, S) ∧ (c, T ) in the program. Feedback from the test will be used to
wrong
select or refine the solution.
Bad case, the self-test produced an error result and fil-
unit-test
¬(p, T ) ∧ (p, S) ∧ ¬(c, T ) tered out a correct solution. Continuously revising the
wrong
self-test code for this test will lead to a performance downgrade.
failed The model is unable to generate a correct solution or test
both
¬(p, T ) ∧ ¬(p, S) ∧ ¬(c, T ) cases. Refining the program over faulty test samples
wrong
will not lead to the correct answer.
In the event of self-test failure, there must have been at
- ¬(p, T ) ∧ (p, S) ∧ (c, T ) least one error in either program or tests, so this condition
should never occur.

Details of Ranking Strategy Comparison (Figure 3.b) We obtain 11 candidate programs from
F UN C ODER on HumanEval with GPT-3.5 and rank them through three strategies. This ensures that
the same candidate set is used for a fair comparison. An effective ranking strategy should prioritize
placing correct programs at the forefront and filter out those with errors. Thus, we measure the
effectiveness by computing Pass@k results on the top-k-ranked programs selected by each strategy.
The Pass@11 result serves as an upper bound as it uses all programs to compute the pass rate.

How We Count Frequently Used Functions in MATH (Table 5) In the mathematical reasoning
experiments, we used a subset of 500 items from the MATH test set, with an average of 71.4 questions
per subject. However, it is not very confident to represent common functions from only 71.4 programs.
Therefore, we sample 3000 problems from the MATH test set for this experiment and run the divide-
only setting of F UN C ODER on them. Then, the occurrence of sub-functions is counted based on their
names after extracting the function nodes of code trees for each category.

20
A.6 Supplementary Results

Open-source model on MBPP and xCodeEval We provide the results for community models
Llama3, StableCode, and CodeLlama on other code generation benchmarks in Table 10. Our method
consistently improves the baseline by averaging 12% on MBPP and 197% on xCodeEval. It is worth
noting that the small model has a low pass rate on the competition problems, leading to a relatively
high randomness, so we report the median by running 3 experiments.

Table 10: Results for MBPP and xCodeEval with community models.

MBPP xCodeEval
Model Method
Pass@1 ∆↑ Easy Mid Hard Expert All
Standard 60.5 - 9.0 1.8 0.0 0.0 3.6
Llama38b
F UN C ODER 62.5 +2.0 22.0 0.9 0.0 0.0 8.0
Standard 51.5 - 7.3 0.9 0.0 0.0 2.8
StableCode3b
F UN C ODER 63.5 +12.0 13.5 4.5 1.1 0.0 6.2
Standard 53.5 - 2.3 0.0 0.0 0.0 0.8
CodeLlama34b
F UN C ODER 58.5 +5.0 10.2 0.0 0.0 0.0 3.6

Token Usage on Other Methods We provide token usage results in Table 11 for F UN C ODER and
baseline methods on the HumanEval dataset with the GPT-3.5 model. We report the average token
usage per problem. The token usage is computed through the sum of prompt tokens and completion
tokens returned by OpenAI API chat completion call6 . For LDB, we report their token usage in the
original paper (Zhong et al., 2024).

Table 11: Token usage for different settings of F UN C ODER and baseline methods. The LDB results
are reported in the original paper. The main setting for LDB and F UN C ODER is bolded.

Method Setting Pass@1 Avg. Tokens

Standard One-time 68.3 886.7
CodeT One-time + Self-Test@11 81.1 (+12.8) 4479.1
Reflexion maxiter=2 69.5 (+1.2) 1416.1
line-level 80.5 (+12.2) 24K
LDB (reported) block-level 82.9 (+14.6) 23K
function-level 79.9 (+11.6) 27K
One-pass 72.6 (+4.3) 1233.7
Two-pass 78.7 (+10.4) 3343.2
F UN C ODER
Two-pass + Consensus@5 83.5 (+15.2) 4040.8
Two-pass + Consensus@11 85.4 (+17.1) 5402.0
F UN C ODER (self-test ablation) Two-pass + Self-Test@11 80.5 (+12.2) 5408.3

Full Results for MATH Levels The MATH dataset divides the problems into five levels of difficulty.
The difficulty distribution of our test set can be found in Table 8. We report the average accuracy
of F UN C ODER and other methods at each level in Table 12. The results of Cumulative Reasoning
are obtained from the original paper (Zhang et al., 2024). Experiment results demonstrate that our
method consistently enhances the model’s reasoning ability across all levels of MATH.

6
https://platform.openai.com/docs/guides/text-generation/managing-tokens

21
Table 12: Full results of each method at different levels of MATH. The best results are in bold.
Text-based reasoning methods are denoted with † , while others use program-aided reasoning.

Model Method Level 1 Level 2 Level 3 Level 4 Level 5 Overall

Standard† 89.7 85.6 83.3 55.2 51.0 68.2
CoT† 94.9 81.1 77.8 64.7 50.3 68.6
GPT-4 PoT 94.9 80.0 74.1 63.8 53.1 68.2
Self-Refine 94.9 81.1 83.3 62.1 60.5 72.2
CR 90.7 90.0 81.9 66.4 52.2 72.2
F UN C ODER 94.9 90.0 81.5 75.9 66.0 78.2
Standard† 61.5 51.1 43.5 25.9 17.7 34.6
CoT† 76.9 48.9 50.9 33.6 21.8 40.0
GPT-3.5 PoT 61.5 51.1 56.5 33.6 24.1 41.0
Self-Refine 84.6 61.1 65.7 32.8 31.3 48.6
F UN C ODER 84.6 65.9 68.5 43.1 37.4 54.0
CoT† 76.9 46.7 46.3 25.9 27.9 38.6
Llama38b PoT 64.1 43.3 41.7 25.0 17.0 32.6
F UN C ODER 79.5 60.0 52.3 37.4 27.9 45.0
PoT 35.9 22.2 19.4 7.8 5.4 14.4
StableCode3b
F UN C ODER 53.8 37.8 35.2 21.6 10.2 26.6
PoT 36.1 30.7 28.0 13.0 8.8 15.2
CodeLlama34b
F UN C ODER 60.6 52.1 44.3 28.8 16.3 24.4

22
B Examples

We provide example solutions for the baseline and F UN C ODER on code generation and mathematical
reasoning. All results are generated with GPT-3.5.

B.1 Code Generation

Info: xCodeEval/ff3c39b759a049580a6e96c66c904fdc (binary search, difficulty 1800)

Problem
time limit per test : 1 second
memory limit per test : 256 megabytes
input : standard input
output : standard output

## Description

You have two friends . You want to present each of them several
positive integers . You want to present cnt1 numbers to the first
friend and cnt2 numbers to the second friend . Moreover , you want
all presented numbers to be distinct , that also means that no
number should be presented to both friends . In addition , the first
friend does not like the numbers that are divisible without
remainder by prime number x . The second one does not like the
numbers that are divisible without remainder by prime number y . Of
course , you ' re not going to present your friends numbers they don
' t like . Your task is to find such minimum number v , that you can
form presents using numbers from a set 1 ,2 ,... , v . Of course you
may choose not to present some numbers at all . A positive integer
number greater than 1 is called prime if it has no positive
divisors other than 1 and itself .

## Input

The only line contains four positive integers cnt1 , cnt2 , x , y (1 <=
cnt1 , cnt2 <109; cnt1 + cnt2 <=109; 2 <=x <y <=3*104) - the numbers that
are described in the statement . It is guaranteed that numbers x , y
are prime .

## Output

Print a single integer - the answer to the problem .

## Sample Input 1

```
3 1 2 3
```

## Sample Output 1

```
5
```

## Sample Input 2

```
1 3 2 3
```

## Sample Output 2

```

23
4
```

## Notes

In the first sample you give the set of numbers {1 ,3 ,5} to the first
friend and the set of numbers {2} to the second friend . Note that
if you give set {1 , 3 , 5} to the first friend , then we cannot give
any of the numbers 1 , 3 , 5 to the second friend . In the second
sample you give the set of numbers {3} to the first friend , and
the set of numbers {1 , 2 , 4} to the second friend . Thus , the
answer to the problem is 4.

B.1.1 Standard
def main () -> None :
cnt1 , cnt2 , x , y = list ( map ( int , input () . split () ) )
# binary search
l , r = 1 , 2 * ( cnt1 + cnt2 ) # upper bound
while l < r :
mid = ( l + r ) // 2
# number of numbers that can be given to the first friend
cnt_x = mid // x
# number of numbers that can be given to the second friend
cnt_y = mid // y
# number of numbers that can be given to both friend
cnt_xy = mid // ( x * y )
# number of numbers that can 't be given to any friend
cnt_none = mid - cnt_x - cnt_y + cnt_xy
if cnt_x >= cnt1 and cnt_y >= cnt2 and cnt_none >= cnt1 + cnt2
:
r = mid
else :
l = mid + 1
print ( l )
Result: Wrong answer on sample test 1: expected: ’5’, found: ’8’ [1th token] ✗

B.1.2 F UN C ODER
def main () -> None :
cnt1 , cnt2 , x , y = map ( int , input () . split () )

left = 1
right = int (2 e9 )
while left < right :
mid = ( left + right ) // 2
if is_valid ( mid , x , y , cnt1 , cnt2 ) :
right = mid
else :
left = mid + 1

print ( left )

def is_valid (v , x , y , cnt1 , cnt2 ) :

# check if there are cnt1 numbers not divisible by x
non_x_numbers = v - ( v // x )
if non_x_numbers < cnt1 :
return False

# check if there are cnt2 numbers not divisible by y

non_y_numbers = v - ( v // y )
if non_y_numbers < cnt2 :
return False

24
# check if there are cnt1 + cnt2 numbers divisible by both x and y
lcm = ( x * y ) // gcd (x , y )
common_numbers = v - ( v // lcm )
if common_numbers < cnt1 + cnt2 :
return False

return True

def gcd ( a : int , b : int ) -> int :

""" get the greatest common divisor of two numbers $a$ and $b$ """

while b :
a, b = b, a % b
return a
Result: Accepted ✓

B.2 Mathematical Reasoning

Info: MATH/number_theory/42 (level 5)

Problem What is the sum of the smallest and second-smallest positive integers a satisfying the
congruence 27a ≡ 17 (mod 40) ?
Ground Truth: 62

B.2.1 CoT
We can solve for $a$ by multiplying both sides of the congruence by
the modular inverse of $27$ modulo $40$ . Since $27$ and $40$ are
relatively prime , we know that $27$ has a modular inverse modulo
$40$ . We can find this inverse using the Euclidean algorithm :
\ begin { align *}
\ text { gcd }(40 ,27) &= \ text { gcd }(27 ,13)
&= \ text { gcd }(13 ,1)
&= 1.
\ end { align *}
Therefore , the modular inverse of $27$ modulo $40$ exists . We can find
it using the Euclidean algorithm backwards :
\ begin { align *}
1 &= 13 - 3 \ cdot 4
&= 13 - 3(40 - 27)
&= 4 \ cdot 27 - 3 \ cdot 40.
\ end { align *}
Thus , the modular inverse of $27$ modulo $40$ is $4$ . Multiplying both
sides of the congruence $27a \ equiv 17 \ pmod {40} $ by $4$ gives
$108a \ equiv 68 \ pmod {40} $ . Simplifying the left - hand side gives
$8a \ equiv 28 \ pmod {40} $ . Dividing both sides by $4$ gives $2a \
equiv 7 \ pmod {10} $ . The smallest positive integer $a$ satisfying
this congruence is $a =9 $ , and the second - smallest is $a =19 $ .
Therefore , the sum of the smallest and second - smallest positive
integers $a$ satisfying the congruence $27a \ equiv 17 \ pmod {40} $ is
$9 +19=\ boxed {28} $ .
Result: 28 ✗

25
B.2.2 PoT
def solution () :
r """ What is the sum of the smallest and second - smallest positive
integers $a$ satisfying the congruence $$27a \ equiv 17 \ pmod {40}~?
$$ """
k = 1
candidates = []
while len ( candidates ) < 2:
a = (40 * k + 17) // 81
if a > 0 and (40 * k + 17) % 81 == 0:
candidates . append ( a )
k += 1

return sum ( candidates )

Result: 74 ✗

B.2.3 F UN C ODER
from typing import Tuple

def solution () :
r """ What is the sum of the smallest and second - smallest positive
integers $a$ satisfying the congruence $$27a \ equiv 17 \ pmod {40}~?
$$ """

a = 27
m = 40
b = 17
inverse = find_inverse (a , m )
x = ( b * inverse ) % m

# Find the smallest and second - smallest positive integers

satisfying the congruence equation
# using the formula : x + k * ( m // gcd (a , m ) )
gcd , _ , _ = extended_euclid (a , m )
s1 = x
s2 = x + ( m // gcd )
return s1 + s2

def find_inverse ( a : int , m : int ) -> int :

""" find the inverse of $a$ modulo $m$ """

_ , x , _ = extended_euclid (a , m )
return x % m if x >= 0 else ( x + m ) % m

def extended_euclid ( a : int , b : int ) -> Tuple [ int , int , int ]:

""" Extended Euclidean Algorithm """
if b == 0:
return a , 1 , 0
gcd , x1 , y1 = extended_euclid (b , a % b )
x = y1
y = x1 - ( a // b ) * y1
return gcd , x , y
Result: 62 ✓

26
C Prompts
C.1 Prompt for Standard
You are a programming copilot , you can solve a problem by writing
Python functions . Your task is to :

- You need to write a Python function that returns the answer .

- You can import libraries to better solve the problem .
- Do not write any code outside the function ( importing is accepted )
.

< User >:

Let ' s think step by step and complete the following Python function `
sum_factor ` that solves :

``` python
def sum_factor ( a : int , b : int ) -> int :
""" Return the sum of all common prime factors of $a$ and $b$ """

raise NotImplementedError ()
```

< Assistant >:

First , I need to get the prime factors of $a$ and $b$ .

Second , I can use `for ` loop to find common element in two factors
list .
Here is the ` sum_factor ` function :

``` python
def sum_factor ( a : int , b : int ) -> int :
""" Return the sum of all common prime factors of $a$ and $b$ """
factors_a = prime_factor ( a )
factors_b = prime_factor ( b )
common_factors = get_common ( factors_a , factors_b )
return sum ( common_factors )

def prime_factor ( x : int ) -> list :

""" get a list of prime factors of number $x$ """
ret = []
i = 1
while i * i <= x :
i += 1
if x % i == 0 and is_prime ( i ) :
ret . append ( i )
return ret

def is_prime ( x : int ) -> bool :

""" determine $x$ is a prime number or not """
if x < 2:
return False
for i in range (2 , int ( x **0.5) + 1) :
if x % i == 0:
return False
return True

def get_common ( a : list , b : list ) -> list :

""" get common element in two list $a$ and $b$ """
ret = []
for item in a :
if item in b :
ret . append ( item )
return ret

27
```

< User >:

Let ' s think step by step and complete the following Python function `{
cur_func_name } ` that solves :

``` python
{ cur_func }
```

C.2 Prompt for Divide

You are a programming copilot , you can solve a problem by writing
Python functions . Your task is to :

- For every turn , you need to write a Python function that returns
the answer based on Current Code ( not code in chat history ) .
- Do not modify function name , arg names , docstring in given
functions .
- You can import libraries to better solve the problem .
- If a single function is too hard to solve , you can decompose it
into multiple smaller functions .
- You can leave new function unimplemented for now , but write the
function at the end of the code and comment what the function does
."

< User >:

Current Code :
``` python
def sum_common_factors ( a : int , b : int ) -> int :
""" Compute the sum of all common prime factors of $a$ and $b$ """
raise NotImplementedError ()
```

Let ' s think step by step and complete the following Python function `
sum_common_factors ` that solves :
" Compute the sum of all common prime factors of $a$ and $b$ "

< Assistant >:

First , I need to get the prime factors of $a$ and $b$ .

Second , I can use `for ` loop to find common element in two factors
list .
Finally , sum the common factor list and return the answer .
Here is the ` sum_common_factors ` function :

``` python
def sum_common_factors ( a : int , b : int ) -> int :
""" Compute the sum of all common prime factors of $a$ and $b$ """
factors_a = prime_factor ( a )
factors_b = prime_factor ( b )
common_factors = get_common ( factors_a , factors_b )
return sum ( common_factors )

def prime_factor ( x : int ) -> list :

""" get a list of prime factors of number $x$ """
raise NotImplementedError ()

def get_common ( a : list , b : list ) -> list :

""" get common element in two list $a$ and $b$ """
raise NotImplementedError ()
```

28
< User >:

Current Code :
``` python
def sum_common_factors ( a : int , b : int ) -> int :
""" Compute the sum of all common prime factors of $a$ and $b$ """
factors_a = prime_factor ( a )
factors_b = prime_factor ( b )
common_factors = get_common ( factors_a , factors_b )
return sum ( common_factors )

def get_common ( a : list , b : list ) -> list :

""" get common element in two list $a$ and $b$ """
raise NotImplementedError ()
```

Let ' s think step by step and complete the following Python function `
get_common ` that solves :
" get common element in two list $a$ and $b$ "

< Assistant >:

Here is the ` get_common ` function :

``` python
def get_common ( a : list , b : list ) -> list :
""" get common element in two list $a$ and $b$ """
ret = []
for item in a :
if item in b :
ret . append ( item )
return ret
```

< User >:

Current Code :
``` python
{ prev_code }
```

Let ' s think step by step and complete the following Python function `{
cur_func_name } ` that solves :
"{ cur_func_doc }"

C.3 Prompt for Conquer

You are a programming copilot , you can solve a problem by writing
Python functions . Your task is to :

- For every turn , you need to write a Python function that returns
the answer , based on current code ( not code in chat history ) and
problem description .
- Do not modify function name , arg names , docstring in given
functions .
- Consider reusing existing functions that are already implemented .
- You can import libraries to better solve the problem .

< User >:

Current Code :

``` python
def prime_factor ( x : int ) -> list :
""" get a list of prime factors of number $x$ """

29
ret = []
i = 1
while i * i <= x :
i += 1
if x % i == 0 and is_prime ( i ) :
ret . append ( i )
return ret

def is_prime ( x : int ) -> bool :

""" determine $x$ is a prime number or not """
if x < 2:
return False
for i in range (2 , int ( x **0.5) + 1) :
if x % i == 0:
return False
return True

def get_common ( a : list , b : list ) -> list :

""" get common element in two list $a$ and $b$ """
ret = []
for item in a :
if item in b :
ret . append ( item )
return ret

def sum_common_factors ( a : int , b : int ) -> int :

""" Return the sum of all common prime factors of $a$ and $b$ """

raise NotImplementedError ()
```

Let ' s think step by step and implement the following method `
sum_common_factors ` using existing functions to solve :
" Return the sum of all common prime factors of $a$ and $b$ "

< Assistant >:

First , I need to get the prime factors of $a$ and $b$ .

Second , I can use `for ` loop to find common element in two factors
list .
Finally , sum the common factor list and return the answer .
Here is the ` sum_common_factors ` function :

< User >:

Current Code :

``` python
{ prev_code }
```

Let ' s think step by step and implement the following method `{
cur_func_name } ` using existing functions to solve :
"{ cur_func_doc }"

30
C.4 Prompt for Generate Possible Input

You are a proficient software engineer and architect , specialized in

testing , capable of observing and finding security holes and
logical issues in code that spans across multiple computer science
fields and mathematics . You are given a piece of Python function ,
and are expected to produce some function calls for that function
. Specifically :

- You should invoke the function in a one - liner fashion .

- Do not bring in imports other than what ' s already imported . Use
the pre - declared imports in the original function only .
- The callee may have multiple arguments , treat them with care .
- You ** must ** respect the function signature and docstring , and be
aware so you don ' t generate illegal inputs .
- Keep the inputs simple but general , and that either edge cases or
common cases are meaningful .

Here follows a series of mutually uncorrelated functions to test , one

per conversation .

< User >:

Let ' s think step by step and create some tests for the following
function ` check_valid_brackets (...) ` in Python .

``` python
def check_valid_brackets ( seq : str ) -> bool :
""" Determine if a bracket sequence consisting of '( ' , ') ', '{ ' ,
'} ' , '[ '
and '] ' is valid . """

mapping = { ') ': '( ' , '} ': '{ ' , '] ': '[ '}
stack = []
for c in seq :
if c in mapping :
if not stack or stack [ -1] != mapping [ c ]:
return False
stack . pop ()
else :
stack . append ( c )
return not stack
```

Store your function calls for ` check_valid_brackets (...) ` as function

callss , one per line . They will be called later .

< Assistant >:

Sure , I can create some function calls for the ` check_valid_brackets `

function . We can either choose to test it with a valid bracket
sequence or an invalid one . Empty strings are also considerable .
Here are some function calls for the function :

``` python
c h e ck_valid_brackets ( " () " ) # True
c h e ck_valid_brackets ( " (([[]]) ) " ) # True
c h e ck_valid_brackets ( " ((() ) " ) # False
c h e ck_valid_brackets ( " () []{} " ) # True
c h e ck_valid_brackets ( " ([) ] " ) # False
c h e ck_valid_brackets ( " " ) # True
c h e ck_valid_brackets ( " ) ( " ) # False
```

< User >:

31
Let ' s think step by step and create some tests for the following
function `{ cur_func_name }(...) ` in Python .

``` python
{ prev_code }
```

Store your function calls for `{ cur_func_name }(...) ` as function

callss , one per line . They will be called later .

C.5 Prompt for Self-Test Generation

You are a proficient software engineer and architect , specialized in
testing , capable of observing and finding security holes and
logical issues in code that spans across multiple computer science
fields and mathematics . You are given a piece of Python function ,
and are expected to produce some test cases for that function .
Specifically :

- You should invoke the function and assert its results in a one -
liner fashion .
- Do not bring in imports other than what ' s already imported . Use
the pre - declared imports in the original function only .
- The callee may have multiple arguments , treat them with care .
- You ** must ** respect the function signature and docstring , and be
aware so you don ' t generate illegal inputs .
- Keep the inputs & outputs simple but general , and that either edge
cases or common cases are meaningful .

Here follows a series of mutually uncorrelated functions to test , one

per conversation .

< User >:

Let ' s think step by step and create some tests for the following
function ` lcm (...) ` in Python .

``` python
def lcm ( a : int , b : int ) -> int :
""" Find the least common multiple of `a ` and `b `. Samples :

>>> lcm (3 , 5)
15
>>> lcm (4 , 6)
12
"""

return round ( a * b / gcd (a , b ) )

```

Store your test cases for ` lcm (...) ` as assertions , one per line . They
will be called later .

< Assistant >:

Sure , I can create some test cases for the ` check_valid_brackets `

function . We consider the following cases : 1. the two operands are
not co - prime and has common factors ; 2. the two operands are
equal ; 3. one of them is 1; 4. two of them is 1; 5. both operands
are primes . Here is an example of these test cases in Python :

``` python
assert lcm (15 , 25) == 75
assert lcm (32 , 32) == 32

32
assert lcm (1 , 5) == 5
assert lcm (1 , 1) == 1
assert lcm (17 , 19) == 17 * 19
```
< User >:

Extract tests for the following function `{ cur_func_name }(...) ` in

Python .

``` python
{ prev_code }
```

Store your test cases for `{ cur_func_name }(...) ` as assertions , one

per line . They will be called later .

C.6 Prompt for MATH judging

You are a mathematical teacher , your task is to :

- Judge whether the prediction is matching the answer

- Output " Judge : Correct ." or " Judge : Wrong ." , please do not
output redundant words
- Numerical errors should be ignored ( $1$ is equal to $0 .99999998 $
)
- Some answer might be represent in latex format , and some might
be float number , this should be consider as correct ( $ \ frac {1}{2} $
is equal to $0 .5 $ , $3$ $ \ sqrt {66} $ is equal to $24 .37211 $ )
- Unit in answer should be ignored , and should be consider as
correct ( $13 cm ^2 $ is equal to $13 .0 $ , $ \ $13$ is equal to $13$ )

Now , the answer and prediction is :

Answer : { ground_truth }
Prediction : { model_output }
Please output " Judge : Correct ." if two answers are literally the same ,
or " Judge : Wrong ." for not same , please do not output redundant
words .

" by Nils Gottfries (2013), Palgrave Macmillan. This Is An Advanced
No ratings yet
" by Nils Gottfries (2013), Palgrave Macmillan. This Is An Advanced
6 pages
1K 2K In-Chassis Maintenance
100% (1)
1K 2K In-Chassis Maintenance
76 pages
Agent Coder 2312.13010v2
No ratings yet
Agent Coder 2312.13010v2
21 pages
Seed Coder
No ratings yet
Seed Coder
46 pages
Stepcoder: Improve Code Generation With Reinforcement Learning From Compiler Feedback
No ratings yet
Stepcoder: Improve Code Generation With Reinforcement Learning From Compiler Feedback
13 pages
Masters Thesis ZX ML
No ratings yet
Masters Thesis ZX ML
33 pages
General Coding Skills Evaluation Framework CodeSignal Skills Evaluation Lab Short
No ratings yet
General Coding Skills Evaluation Framework CodeSignal Skills Evaluation Lab Short
9 pages
Self-Planning Code Generation With Large Language Models
No ratings yet
Self-Planning Code Generation With Large Language Models
29 pages
Acecode: A Reinforcement Learning Framework For Aligning Code Efficiency and Correctness in Code Language Models
No ratings yet
Acecode: A Reinforcement Learning Framework For Aligning Code Efficiency and Correctness in Code Language Models
20 pages
Lets Verify Step by Step
No ratings yet
Lets Verify Step by Step
29 pages
Week1 Slide ECE4010
No ratings yet
Week1 Slide ECE4010
301 pages
s41586 023 06924 6 - Reference
No ratings yet
s41586 023 06924 6 - Reference
33 pages
A S A L L M C QD: Gent Kill Cquisition For Arge Anguage Odels Via Ycle
No ratings yet
A S A L L M C QD: Gent Kill Cquisition For Arge Anguage Odels Via Ycle
22 pages
Lavin - Thesis 4-22-18
No ratings yet
Lavin - Thesis 4-22-18
168 pages
Romera-Paredes - FunSearch
No ratings yet
Romera-Paredes - FunSearch
14 pages
Question 1: How Does The Use of Heuristics Reduce The Search Space ?
No ratings yet
Question 1: How Does The Use of Heuristics Reduce The Search Space ?
19 pages
Operations Research Summary
No ratings yet
Operations Research Summary
30 pages
Model Book
No ratings yet
Model Book
419 pages
OpenCoder 1731317971
No ratings yet
OpenCoder 1731317971
35 pages
Automatic Representation Changes in Problem Solving - Eugene Fink
No ratings yet
Automatic Representation Changes in Problem Solving - Eugene Fink
496 pages
Qwen2.5-Coder Technical Report
No ratings yet
Qwen2.5-Coder Technical Report
32 pages
Fairness-Aware Federated Learning With Real-Time Bias Detection and Correction
No ratings yet
Fairness-Aware Federated Learning With Real-Time Bias Detection and Correction
4 pages
Code To Think, Think To Code - A Survey On Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs
No ratings yet
Code To Think, Think To Code - A Survey On Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs
28 pages
Competition Level Code Generation With Alphacode
No ratings yet
Competition Level Code Generation With Alphacode
74 pages
C S: Accelerating LLM Inference For Efficient Code Generation
No ratings yet
C S: Accelerating LLM Inference For Efficient Code Generation
13 pages
1066 CodeDPO Aligning Code Mod
No ratings yet
1066 CodeDPO Aligning Code Mod
20 pages
CODEJUDGE: Evaluating Code Generation With Large Language Models
No ratings yet
CODEJUDGE: Evaluating Code Generation With Large Language Models
20 pages
Class Xii Ai Worksheet Booklet Part2 2023-2024
No ratings yet
Class Xii Ai Worksheet Booklet Part2 2023-2024
26 pages
O C: T O C T - T C L L M: PEN Oder HE PEN Ookbook For OP IER ODE Arge Anguage Odels
No ratings yet
O C: T O C T - T C L L M: PEN Oder HE PEN Ookbook For OP IER ODE Arge Anguage Odels
35 pages
Hardening Function For Large Scale Distributed Computations: Doug Szajda Barry Lawson Jason Owen
No ratings yet
Hardening Function For Large Scale Distributed Computations: Doug Szajda Barry Lawson Jason Owen
39 pages
Towards Advancing Code Generation With Large Language Models: A Research Roadmap
No ratings yet
Towards Advancing Code Generation With Large Language Models: A Research Roadmap
10 pages
Multi-Agent Code Generation and Problem Solving Through Simulation-Driven Planning and Debugging
No ratings yet
Multi-Agent Code Generation and Problem Solving Through Simulation-Driven Planning and Debugging
27 pages
White-Box and Black-Box Testing Techniques Example in Detail
No ratings yet
White-Box and Black-Box Testing Techniques Example in Detail
8 pages
Treeof Code
No ratings yet
Treeof Code
13 pages
A: Active Retrieval in Knowledge Soup For Code Generation
No ratings yet
A: Active Retrieval in Knowledge Soup For Code Generation
16 pages
Qwen2.5-Coder Technical Report: Binyuan Hui Jian Yang Zeyu Cui Jiaxi Yang
No ratings yet
Qwen2.5-Coder Technical Report: Binyuan Hui Jian Yang Zeyu Cui Jiaxi Yang
23 pages
Data Science Project - An Inductive Learning Approach, Verri
No ratings yet
Data Science Project - An Inductive Learning Approach, Verri
238 pages
Soft Computing and Its Applications: Kurhe A.B., Satonkar S.S., Khanale P.B. and Shinde Ashok
No ratings yet
Soft Computing and Its Applications: Kurhe A.B., Satonkar S.S., Khanale P.B. and Shinde Ashok
3 pages
Soft Computing and Its Applications: Kurhe A.B., Satonkar S.S., Khanale P.B. and Shinde Ashok
No ratings yet
Soft Computing and Its Applications: Kurhe A.B., Satonkar S.S., Khanale P.B. and Shinde Ashok
3 pages
Efficient Code LLM Training Via Distribution-Consistent and Diversity-Aware Data Selection
No ratings yet
Efficient Code LLM Training Via Distribution-Consistent and Diversity-Aware Data Selection
11 pages
Ai HF
No ratings yet
Ai HF
15 pages
OpenAI Codex Arxiv
No ratings yet
OpenAI Codex Arxiv
35 pages
AI Answers-1
No ratings yet
AI Answers-1
23 pages
Paper 1
No ratings yet
Paper 1
10 pages
Performance-Aligned Llms For Generating Fast Code
No ratings yet
Performance-Aligned Llms For Generating Fast Code
12 pages
Black Box Testing
No ratings yet
Black Box Testing
13 pages
Ai Project Cycle Short Note
No ratings yet
Ai Project Cycle Short Note
9 pages
22-Structured Introduction
No ratings yet
22-Structured Introduction
66 pages
Is Your Code Generated by Chatgpt Really Correct?: Rigorous Evaluation of Large Language Models For Code Generation
No ratings yet
Is Your Code Generated by Chatgpt Really Correct?: Rigorous Evaluation of Large Language Models For Code Generation
15 pages
Federated Learning For Generalization Robustness Fairness A Survey and Benchmark
No ratings yet
Federated Learning For Generalization Robustness Fairness A Survey and Benchmark
20 pages
2024 Emnlp-Main 92
No ratings yet
2024 Emnlp-Main 92
25 pages
1 s2.0 S0306457325000019 Main
No ratings yet
1 s2.0 S0306457325000019 Main
16 pages
Sample Mid Term ACI
No ratings yet
Sample Mid Term ACI
3 pages
Orca: Progressive Learning From Complex Explanation Traces of GPT-4
No ratings yet
Orca: Progressive Learning From Complex Explanation Traces of GPT-4
24 pages
Lecture 0
No ratings yet
Lecture 0
299 pages
Artifical Python
No ratings yet
Artifical Python
1,159 pages
Fuzzy Logic Basics VK
No ratings yet
Fuzzy Logic Basics VK
49 pages
Evaluating Large Language Models Trained On Code
No ratings yet
Evaluating Large Language Models Trained On Code
35 pages
Report
No ratings yet
Report
59 pages
Thesis
No ratings yet
Thesis
292 pages
Ai Unit 2 QB
No ratings yet
Ai Unit 2 QB
12 pages
Rubrics Essay
No ratings yet
Rubrics Essay
1 page
Cross Cutting Issues in Governance
No ratings yet
Cross Cutting Issues in Governance
2 pages
The Shahnameh: The Persian Epic in World Literature Hamid Dabashi Download
100% (2)
The Shahnameh: The Persian Epic in World Literature Hamid Dabashi Download
59 pages
Medisin The Causes Solutions To Disease Malnutrition and The Medical Sins That Are Killing The World 1st Scott Whitaker PDF Download
No ratings yet
Medisin The Causes Solutions To Disease Malnutrition and The Medical Sins That Are Killing The World 1st Scott Whitaker PDF Download
82 pages
Table Tennis KNSKDJCBSK
No ratings yet
Table Tennis KNSKDJCBSK
9 pages
PC1015
No ratings yet
PC1015
13 pages
Family Business Management Presentation
No ratings yet
Family Business Management Presentation
16 pages
Sizing and Selection of Grounding TransformersDecision Criteria
100% (2)
Sizing and Selection of Grounding TransformersDecision Criteria
5 pages
Application of Linear Programming Techniques To Practical
100% (1)
Application of Linear Programming Techniques To Practical
13 pages
Dissertation Business Plan
100% (1)
Dissertation Business Plan
5 pages
Lesson 2 - Rights and Obligations of Parties
No ratings yet
Lesson 2 - Rights and Obligations of Parties
9 pages
Lime 2
No ratings yet
Lime 2
11 pages
Definition and Dispute: A Defense of Temporal Externalism 1st Edition Derek Ball Instant Download
100% (1)
Definition and Dispute: A Defense of Temporal Externalism 1st Edition Derek Ball Instant Download
57 pages
Grami Product List & Price 2021
No ratings yet
Grami Product List & Price 2021
6 pages
Feasibility LPG Plant September 26, 2012 Revised Capacity
100% (2)
Feasibility LPG Plant September 26, 2012 Revised Capacity
22 pages
Sri Dev Suman Uttarakhand University ी देव सुमन उ तराख ड व व व यालय
No ratings yet
Sri Dev Suman Uttarakhand University ी देव सुमन उ तराख ड व व व यालय
1 page
Benchmarking Optimizers
No ratings yet
Benchmarking Optimizers
30 pages
Knook Sampler Scarf
No ratings yet
Knook Sampler Scarf
6 pages
What Is Identity & Access Management (IAM) ?
100% (1)
What Is Identity & Access Management (IAM) ?
8 pages
30 REPHRASING TEST With SOLUTIONS
No ratings yet
30 REPHRASING TEST With SOLUTIONS
4 pages
How To Draw and Read Line Diagrams Onboard Ships
No ratings yet
How To Draw and Read Line Diagrams Onboard Ships
23 pages
Daily Report Swiss Embassy Jakarta
No ratings yet
Daily Report Swiss Embassy Jakarta
1 page
Entrepreneurship and Innovation in Pharmacy - 2022 - Canvas
No ratings yet
Entrepreneurship and Innovation in Pharmacy - 2022 - Canvas
29 pages
Practice Math AA HL Paper1
100% (2)
Practice Math AA HL Paper1
12 pages
FQ P1YIydaRO5Vamw3Z8XJDmy3y9
No ratings yet
FQ P1YIydaRO5Vamw3Z8XJDmy3y9
6 pages
Letter in Support of Responsible Fintech Policy
No ratings yet
Letter in Support of Responsible Fintech Policy
155 pages
Case For TCP-2
No ratings yet
Case For TCP-2
14 pages
Day 3-2 Logical Framework
No ratings yet
Day 3-2 Logical Framework
21 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.