Divide-and-Conquer Meets Consensus: Unleashing The Power of Functions in Code Generation
Divide-and-Conquer Meets Consensus: Unleashing The Power of Functions in Code Generation
2
Zhejiang University
{jcchen, zchu, zkwang, mliu, qinb}@ir.hit.edu.cn
jeffswt@outlook.com chenqianglong.ai@gmail.com
Abstract
Despite recent progress made by large language models in code generation, they
still struggle with programs that meet complex requirements. Recent work uti-
lizes plan-and-solve decomposition to decrease the complexity and leverage self-
tests to refine the generated program. Yet, planning deep-inside requirements
in advance can be challenging, and the tests need to be accurate to accomplish
self-improvement. To this end, we propose F UN C ODER, a code generation frame-
work incorporating the divide-and-conquer strategy with functional consensus.
Specifically, F UN C ODER recursively branches off sub-functions as smaller goals
during code generation, represented by a tree hierarchy. These sub-functions are
then composited to attain more complex objectives. Additionally, we designate
functions via a consensus formed by identifying similarities in program behavior,
mitigating error propagation. F UN C ODER outperforms state-of-the-art methods by
+9.8% on average in HumanEval, MBPP, xCodeEval and MATH with GPT-3.5 and
GPT-4. Moreover, our method demonstrates superiority on smaller models: With
F UN C ODER, StableCode3b surpasses GPT-3.5 by +18.6% and achieves 97.7% of
GPT-4’s performance on HumanEval. Further analysis reveals that our proposed
dynamic function decomposition is capable of handling complex requirements, and
the functional consensus prevails over self-testing in correctness evaluation.
1 Introduction
Over the past few years, large language models have been observed to attain significant advancements
in coding capabilities (OpenAI, 2023; Touvron et al., 2023). Meanwhile, models designed specifically
for coding tasks have also been introduced (Rozière et al., 2023; Lozhkov et al., 2024; Pinnaparaju
et al., 2024). Although LLMs can proficiently generate simple code snippets, they suffer from a
decline in performance as code requirements become complicated.
Numerous efforts have been made to tackle this complexity. The two-stage methods (Jiang et al.,
2023; Zelikman et al., 2023) employ the plan-and-solve strategy, which first generates a draft outline
for the complex task and uses it as guidance for implementing the code in the second stage. Multi-
agent development frameworks (Hong et al., 2024; Qian et al., 2023) mimic real-world software
development workflows, assign different roles to LLMs and collaborate to solve a complex goal.
Self-improvement (Shinn et al., 2023; Chen et al., 2024), on the other hand, refines the program in
accordance with execution feedback from self-generated unit tests.
Despite fruitful efforts made by the previous methods in dealing with complex problems, certain
challenges still remain unsolved: (1) Two-stage approaches need to design a complete plan at the
∗ †
Equal contribution. Corresponding author.
𝒇′ 𝒙 = divide conquer $
𝑓(") ⋮
$
𝑓(%) 𝑓 $&'"
⋮ 𝑓 $&
FUNCODER 𝑥"
𝑔(𝑦) 𝑔∗ (𝑦) 𝑥# 𝑔∗ (𝑦)
𝑋= …
FUNCODER
ℎ(𝑧) ℎ∗ (𝑧) 𝑥$ ℎ∗ (𝑧)
Possible Inputs
Figure 1: A flowgraph illustrates F UN C ODER. F UN C ODER branches off new functions to have
sub-goals tackled iteratively (left), re-composites sub-functions, and selects the best using functional
consensus (right). Bottom-right figure shows how F UN C ODER writes functions at hierarchy-level.
beginning and lack the ability to adjust the top-level design during implementation, leading to sub-
optimal decomposition. (2) Multi-agent collaboration frameworks are cumbersome and rely heavily
on LLM capabilities, making them difficult to generalize to smaller open-source models. (3) Code
refinement through self-tests depends on the correctness of generated unit-tests. Our preliminary
study (§3.1.3) finds that models generate unreliable self-tests in abundance. These incorrect tests may
mislead self-improvement and, at worse, exacerbate program errors.
To address these issues, we propose F UN C ODER, a code generation framework utilizing a divide-and-
conquer strategy and a novel functional consensus mechanism on functions to decompose complex
problems. Starting from the main problem, F UN C ODER introduces new functions to cope with
certain sub-problems. The new functions will be decomposed recursively, eventually forming a tree
of functions. F UN C ODER then combines functions bottom-up to achieve increasingly complicated
objectives. By dividing-and-conquering tasks into simpler sub-functions, complexity can be gradually
reduced. However, errors in sub-functions may propagate to the whole program, thereby damaging
overall reliability. We propose functional consensus that samples multiple functions and selects the
one demonstrating consensus, measured by the aggregated similarity among candidates. By reaching
a consensus, we reduce the discrepancies in code behavior and thus alleviate cascading errors.
We conduct extensive experiments on code generation benchmarks (Chen et al., 2021; Austin et al.,
2021; Khan et al., 2023) with GPT (Ouyang et al., 2022; OpenAI, 2023), outperforming state-of-
the-art methods by +9.8% on average. Experiments are further carried out on the mathematical
competition benchmark, MATH (Hendrycks et al., 2021b), achieving a +6.0 improvement with
GPT-4, indicating that F UN C ODER can also generalize to complex reasoning. Our method is observed
to be equally effective on open-source models (Rozière et al., 2023; Pinnaparaju et al., 2024; Meta
AI, 2024), with an average gain over baseline of +38.0% on HumanEval and +61.1% on MATH.
Additional analysis also shows the advantage of both divide-and-conquer and functional consensus.
A function is defined as a relation between a set of inputs and outputs where each input is assigned
exactly one output (Halmos, 1998), denoted as y = f (x). In computer programming, a function is
identified by its header hf with its body bf , and is commonly accompanied by a documentation df to
improve readability. Functions can be invoked from other procedures, allowing for the decomposition
of large and complicated requirements into smaller structures that exhibit high comprehensibility
and quality (Dahl et al., 1972). Generally, human programmers tend to decompose tasks into clearly
2
Algorithm 1 F UN C ODER procedure (a) Planning-based Decomposition
Figure 2: Left: Algorithm for F UN C ODER procedure. Right: Comparison between decomposition by
planning and our approach. F UN C ODER introduces new functions to describe sub-goals solely with
code, achieving a more natural way of requirement decomposition.
defined sub-functions and then implement them recursively, making functions eligible for re-usage,
taking advantage of the divide-and-conquer principle. Inspired by this, F UN C ODER recursively
divides the requirement and conquers functions to formulate a sophisticated solution, unleashing the
potential of LLMs in code generation.
Divide is a top-down process that iteratively breaks down problems. Given a code generation
problem, the process begins from the entry function froot . We instruct the model to introduce new
functions fi ∈ CHILD(fcur ) that solve certain sub-goals while writing the current fcur . To reduce the
complexity involved in each generation, we only require the headers hfi and documentation dfi of
new functions to be generated, while their implementations bfi can be postponed. After completing
the current function, the model starts to address those unimplemented sub-functions and complete bfi
into fi′ . This process stops when the model deems functions too simple to be further divided, finally
forming a dependency tree T = T REE(froot , C HILD(froot )). The divide process is similar to a search
starting from the entry function, gradually involving new sub-functions while writing the current, and
implementing them recursively. We guide the entire process through a depth-first search.
Conquer is a process of achieving complex objectives through aggregating smaller functions. We
notice that child functions are not yet implemented during the top-down process of writing parent
functions. As a result, these parent functions may not be able to effectively utilize the child functions,
or misuse them at worst. F UN C ODER deals with this issue by re-generating functions in inverse
topological order on the dependency tree T - starting from leaves, complex goals are handled by
∗ ′
compositing solved children as fcur ← F(fcur , {f1∗ , f2∗ , . . . }) | fi∗ ∈ C HILD(fcur ).
Divide and conquer naturally achieve both decomposition and composition during code generation.
Unlike two-stage and agent-based methods, our approach dynamically introduces new functions
along the process, making it less burdensome than producing a complete plan at the very beginning.
Moreover, while planning or agents require chat capabilities, F UN C ODER represents sub-tasks
through functions (Figure 2), making it more applicable to specialized code generation models.
The decomposition of complex tasks benefits from solving easier sub-goals, but might introduce
the risks of cascading errors. To mitigate this, we introduce Functional Consensus which aims at
reducing inconsistencies in program behavior. This is achieved by sampling multiple functions and
selecting the one that exhibits consensus, as measured by the aggregated similarity of functionality
between candidates, thus abating outlier functionalities.
3
Functionality Similarity A program specifies its functionality (or behavior) through the control
flow defined by its code semantics. However, comparing the functionalities between two programs
based on their semantics is somewhat challenging. By decomposing the requirement into functions,
F UN C ODER is able to view the function behavior as a black box that maps arguments into return
values. Considering two functions f and g with the same input domain D(f ) = D(g), we define the
similarity between them sim(f, g) as the identicalness of outputs when given the same input values.
Z
1 [f (x) = g(x)] X 1 [f (x) = g(x)]
sim(f, g) = ≈ (1)
x∈D(f ) |D(f )| |X|
x∈X|X∼D(f )
The similarity becomes 1 if and only if two functions output consistent values for all inputs: ∀x ∈
D(f ) : f (x) = g(x) ⇔ sim(f, g) = 1. We notice that the input domain D(f ) is unbounded in most
cases, making its measurement barely feasible in practice. Thus, we approximate it by sampling a
subset of possible inputs X ∼ D(f ) with an LLM.
Consensus is reached by selecting the candidate f ∗ holding maximal similarity with others after
sampling multiple function implementations F = {f(i) } for the same requirements.
X
f ∗ = F UN C ONSENSUS(F ) = arg max sim(f(i) , f(j) ) (2)
f(i) ∈F
f(j) ∈F \{f(i) }
By introducing functional consensus, F UN C ODER produces functions that are more consistent and
common in functionality, while omitting abnormal samples. The process is applied to not just the final
program, but also to every sub-tree during the bottom-up conquering stage, resulting in step-by-step,
thorough verification from the most fundamental functions all the way up to the whole program.
We design F UN C ODER as a procedure that takes a problem in the form of a function signature f (x),
and produces a final solution f ∗ (x), as exemplified in Figure 1. Given a problem f (x), F UN C ODER
partially implements the function as f ′ (x) referring to unimplemented sub-functions g(y) and h(z).
These sub-functions are then fed into F UN C ODER to be recursively coped with. We then sample
′
k implementations f(i) (x) based on solved children g ∗ (y) and h∗ (z). Functional consensus is
calculated by evaluating candidates on possible inputs. The function sharing maximal behavioral
similarity is combined with solved children to formulate the final solution.
3 Experiments
We conduct experiments on competition-level code generation and mathematical reasoning bench-
marks with state-of-the-art LLMs, which are covered in section §3.1 and §3.2, respectively. In
addition to GPT models (Ouyang et al., 2022; OpenAI, 2023), we also conduct experiments with
community models like Llama38b (Meta AI, 2024), StableCode3b (Pinnaparaju et al., 2024), and
CodeLlama34b (Rozière et al., 2023). We use the instruct variant of these models and inference on a
single A100-80G under BF16 precision with vLLM (Kwon et al., 2023).
We choose three benchmarks for code generation evaluation: (a) HumanEval (Chen et al., 2021)
includes entry-level coding questions; (b) MBPP (Austin et al., 2021) contains questions of standard
library invocation and programming basics; and (c) xCodeEval (Khan et al., 2023) consists of
algorithmic challenges sourced from the competitive programming platform CodeForces.
4
Table 1: Experiment results on code generation benchmarks. We report Pass@1 as evaluate metric.
Results from the original paper are underlined, and the best results are bold.
Baselines We compare F UN C ODER with standard prompting (Brown et al., 2020), two-stage
decomposition method Parsel (Zelikman et al., 2023), self-testing method CodeT (Chen et al., 2023a),
self-improvement methods Reflexion and LDB (Shinn et al., 2023; Zhong et al., 2024), and multi-
agent developing framework MetaGPT (Hong et al., 2024). We implement Standard prompting with
a 1-shot demonstration. CodeT samples 11 solutions with standard prompting and evaluates them on
model-generated tests. The results for Reflexion are reproduced from the original code.
Implementation Details F UN C ODER uses a 2-shot prompt in the divide stage and 1-shot for
conquering sub-functions. The number of sampled implementations in the functional consensus is set
to 11 for code generation tasks. For further implementation details, please refer to Appendix A.1.
3.1.2 Results
Table 1 shows the code generation performance on advanced proprietary models, GPT-3.5 (Ouyang
et al., 2022) and GPT-4 (OpenAI, 2023). For basic programming questions, HumanEval and MBPP,
F UN C ODER surpass previous SOTA methods by +3.3% in Pass@1 and reduce the error rate by 18.6%.
Furthermore, F UN C ODER demonstrates a substantial improvement on competition-level problems,
outperforming others by 10.4% in GPT-4 and 35.3% with GPT-3.5. We observe that F UN C ODER can
enhance LLM’s capability of solving more complex programming tasks, with an average accuracy
improvement of 82.3% over the baseline on the Mid and Hard subsets of xCodeEval. Expert level
programs, however, still remain a colossal challenge for even the most cutting-edge LLMs.
Evaluation is also performed over community LLMs, Llama3 (Meta AI, 2024), StableCode (Pinna-
paraju et al., 2024), and CodeLlama (Rozière et al., 2023) with results in Table 2, 10. F UN C ODER
consistently boosts the performance of smaller models in code generation, with an averaged improve-
ment of +38.0% compared to standard prompting, and outperforms the previous best method CodeT
by +14.6% on HumanEval. Experiment results demonstrate that our method archives state-of-the-art
performance on various models, ranging from basic programming to competition contests.
3.1.3 Analysis
F UN C ODER Democratize to Smaller LLMs Limited by the LLM capabilities, the application of self-
improvement or multi-agent methods on smaller models is without ease. By keeping decomposition
5
(a) Preliminary Study on Self-testing (b) Effectiveness of Ranking Strategy
90.2
final passed 43.9% 44.5% 90
85.4
final failed 4.3% 19.5% 85
Pass@k
program wrong 12.8% 5.5% 80.5 80
Strategy
Self-Test Result consensus
unit-test wrong 25.0% 16.5% 75
passed self-test
both incorrect 14.0% 14.0% failed 69.5 random 70
50 40 30 20 10 0 0 10 20 30 40 50 1 2 3 4 5 6 7 8 9 10 11
GPT-3.5 StableCode3b Num Selected Programs
Figure 3: (a) Preliminary study on self-testing, the programs are evaluated using unit-tests generated
by LLMs. (b) The effectiveness of different ranking strategies. We compute the Pass@k over top-k
programs ranked by functional consensus, self-test, and random on 11 candidates. (higher is better)
and composition within the code generation process, our approach exhibits better generalization.
As shown in Table 1, 2, with F UN C ODER, Llama38b and StableCode3b achieve around 1.18×
performance to standard GPT-3.5, and are closely aligned with GPT-4 by about 97% on HumanEval.
Preliminary Study on Self-Testing Method We conduct a preliminary study targeting the self-testing
method on HumanEval, results are shown in Figure 3.a with further details in Appendix A.5. We first
verify whether model-generated programs can also pass model-generated self-tests: (a) If a program
passes self-tests, most from GPT-3.5 would also work on system tests, as much as 19.5%/64% ≈ 30.5%
programs from StableCode are rejected, indicating that smaller models like StableCode may not
effectively self-test and detect program errors on its own. (b) In the event of failed self-tests, a large
portion of failures are attributed to issues in self-tests instead of the programs, on both GPT-3.5
and StableCode. These phenomena indicate that self-testing methods have limitations in generating
correct and reliable unit tests. As a result, we design functional consensus to not require any assertion,
but perform mutual verification between solutions instead, as opposed to self-testing.
Effectiveness of Functional Consensus Functional consensus or self-testing may be viewed as
ranking algorithms for selecting functions. To measure ranking effectiveness, we conduct an analysis
on HumanEval with GPT-3.5. For each problem, 11 candidates are ranked with 3 strategies: consensus,
self-test, and random shuffle (as a baseline). Effectiveness is measured via Pass@k, i.e. if any of the
top-k ranked programs pass the system test. Figure 3.b shows that functional consensus achieves
94.7% upper bound (Pass@11) performance by selecting a single function (Pass@1), and is close
to that of self-test on Pass@4. This clearly demonstrates that functional consensus can effectively
evaluate correctness and pick the most promising implementation on the first attempt.
Table 3: Ablation study of F UN C ODER on HumanEval with GPT-3.5. The setting in our main
experiment is highlighted in bold. Tokens are calculated as the sum of prompts and completions.
Ablation and Token Usage To analyze the impact of dividing, conquering, and functional consensus
in F UN C ODER, we carry out an ablation study with different settings. A study that replaces consensus
with self-testing is also included. The ablation is constructed on HumanEval with GPT-3.5, as
shown in Table 3. We observe that function decomposition and re-composition deliver cumulative
performance improvements. Functional consistency is also shown to prevail over self-testing. Putting
them all together, F UN C ODER received a +17.1 improvement with 5.09× more tokens over baseline.
Compared to previous SOTA LDB (≈ 23K tokens), we are able to gain +2.5 in performance with
76.5% token usage reduction.
6
Table 4: Experimental results on MATH, a competition-level mathematical reasoning benchmark.
Best results are in bold. Text-based reasoning methods are denoted with † , while others use program-
aided reasoning. We report both overall results and results in seven subjects: Prealgebra, Algebra,
Number Theory, Counting & Probability, Geometry, Intermediate Algebra, and Precalculus.
Code can be viewed as a tool for augmenting the reasoning capabilities of LLMs (Chen et al., 2023b).
Alternative to text-based reasoning like Chain-of-Thought (Wei et al., 2022), programs can offer
unique advantages in terms of iteration and calculations. To test the generalizability of F UN C ODER
beyond algorithm challenges, we conduct an experiment on MATH (Hendrycks et al., 2021b), a
competition-level mathematical reasoning benchmark.
3.2.2 Results
The experimental results on MATH are shown in Table 4. It shows that program-aided reasoning
generally outperforms text-based reasoning. With GPT-4 as the backbone, F UN C ODER outperforms
the strongest baseline Cumulative Reasoning (Zhang et al., 2024) by (6.0 / 8.3%) and surpasses the
7
vanilla program-aided baseline PoT (Chen et al., 2023b) by (10.0 / 14.7%). When using GPT-3.5-
turbo as the backbone, F UN C ODER exceeds the strongest baseline by (6.2 / 11.1%) and outperforms
PoT by as much as (13.0 / 31.7%), which indicates that our approach has a strong advantage over
both text-based reasoning and other program-aided reasoning methods.
On open-source models, F UN C ODER with Llama3 outperforms PoT by (12.4 / 38.0%). It has even
reached competitive performance against the state-of-the-art method based on GPT-3.5 (45.0 v.s.
48.6). When employing StableCode and CodeLLaMA as the backbone, our approach achieves
significant improvements by (12.2 / 84.7%) and (9.2 / 60.5%), respectively. This improvement
demonstrates that our approach can significantly boost smaller LLMs, democratizing the complex
reasoning capabilities of open-source LLMs through programming.
3.2.3 Analysis
GPT-3.5 StableCode3b
F UN C ODER Can Handle Harder Questions 80 Ours 50
Figure 4 compares between CoT, PoT, and F UN - PoT
40
60
C ODER across varying difficulty levels. It illus- CoT
30
trates that CoT performs comparatively well on 40 20
the easiest questions, but suffers from a steep 20 10
decline in performance as difficulty increases. 0 0
This suggests that text-based reasoning is inad- l1 l2 l3 l4 l5 l1 l2 l3 l4 l5
L eve L eve L eve L eve L eve L eve Leve Leve Leve Leve
equate for tackling challenging mathematical
reasoning problems. The same situation is also Figure 4: Average accuracy in each level with
observed in PoT. In contrast, our method consis- the chat model (GPT-3.5) and the code model
tently demonstrates high performance even on (StableCode ) on the MATH benchmark.
3b
challenging problems, particularly excelling on
level 5 difficulty with nearly double the perfor-
mance compared to PoT and CoT. This reflects that our method, with divide-and-conquer applied,
can effectively cope with complex problems.
Decomposed Functions are Domain-Specific We hypothesize that questions from the same subject
require similar knowledge reserves, which should be reflected in the functionality of the sub-functions.
To verify this hypothesis, we statisticize the common sub-functions of F UN C ODER in each MATH
subject, as shown in Table 5. It is apparent that different subjects require different abilities, each
with its own set of sub-functions closely associated with the domain knowledge. In addition, these
common sub-functions are fundamentally basic and straightforward. As exemplified in Appendix B.2,
our method is able to leverage and combine these basic sub-functions to achieve more complex goals,
thereby reducing the complexity of reasoning and enhancing performance.
Table 5: Top-3 most commonly used functions in each subject of MATH, listed in descending order.
Subject Functions
Prealgebra is_prime / factorial / gcd
Algebra find_roots / is_perfect_square / find_domain
Number Theory get_divisors / mod_inverse / gcd
Counting & Probability factorial / combinations / binomial_coefficient
Geometry distance / simplify_fraction / calculate_triangle_area
Intermediate Algebra find_roots / evaluate_polynomial / lagrange_interpolation
Precalculus cross_product / fraction_from_angle / dot
4 Related Work
Large Language Model for Code Code pre-training has received widespread attention, with early
models based on small language models (SLM) (Feng et al., 2020; Lu et al., 2021; Wang et al., 2021).
In recent years, with the development of large-scale pre-training techniques, code LLM has emerged,
showing remarkable performance in downstream code tasks (Chen et al., 2021; Nijkamp et al., 2023;
Li et al., 2022; Rozière et al., 2023; Li et al., 2023b; Guo et al., 2024). Tasks between code and
natural language (NL) can be generally divided into three major categories: NL2Code tasks such as
8
code generation (Austin et al., 2021; Chen et al., 2021; Hendrycks et al., 2021a; Khan et al., 2023)
and code search (Husain et al., 2019a); Code2Code tasks including code completion (Lu et al., 2021;
Zhang et al., 2023a; Liu et al., 2024), code translation (Ahmad et al., 2023; Zhu et al., 2022; Yan
et al., 2023), and test generation (Siddiq et al., 2023; Schäfer et al., 2024); Code2NL tasks like code
summarization (Husain et al., 2019b; Jin et al., 2023). This paper focuses on code generation tasks,
ranging from basic to competition level.
Code Refinement and Self-Testing Code doesn’t always run as expected; it could contain syntax
errors, dead loops, or bugs. It’s essential to debug and refine the code to ensure better quality.
CodeT (Chen et al., 2023a) generates unit-tests to score the implementation. Self-improvement
methods (Madaan et al., 2023; Shinn et al., 2023; Chen et al., 2024; Zhong et al., 2024) design
closed-loop procedures that repeatedly refine the code based on the feedback. Like real-life software
development processes, multi-agent frameworks (Hong et al., 2024; Qian et al., 2023) construct
specific LLM roles, Tester or QA to generate tests. These studies adopt a shared paradigm wherein
self-tests are generated through LLMs. However, Olausson et al. (2024) points out the challenge that
LLMs have certain shortcomings in self-repairing their code. This paper avoids these shortcomings
by proposing functional consensus as a reliable method of evaluation.
Program-Aided Reasoning and Agents Aside from code generation tasks, the program can be
a tool that augments LLM to solve complex reasoning questions or interact with external environ-
ments. Program-of-Thought (Chen et al., 2023b) and PAL (Gao et al., 2023) prompt the model to
generate a program that solves mathematical or symbolic problems. MathPrompter (Imani et al.,
2023) and Chain-of-Code (Li et al., 2023a) fuse the text-based chain-of-thought with code-based
program-of-thought prompting to complement each other in mathematical reasoning. Cumulative
Reasoning (Zhang et al., 2024) conducts bottom-up reasoning to derive the final answer progres-
sively. Numerous work (Sun et al., 2023; Wang et al., 2024; Yang et al., 2024) also use code as an
intermediate component to bridge LLM agents with external environments.
Decompose for Complex Problems Several recent works employ decomposition to reduce the
complexity of hard problems. Least-to-Most (Zhou et al., 2023) adopts a two-stage approach, which
first decomposes complex problems, and then solves each sub-problem individually to tackle complex
reasoning tasks. Successive Prompting (Dua et al., 2022) adopts a dynamic decomposition, iteratively
breaking down problems and addressing sub-problems. Tree-of-Thought (Yao et al., 2023) breaks
down complex problems into state spaces and uses tree search to solve them. Parsel (Zelikman
et al., 2023) introduces decomposition to code generation tasks, taking a three-stage to break down
requirements into draft and intermediate parsel programs. RepoCoder (Zhang et al., 2023b) performs
a retrieval in repositories to complete unfinished code one by one. Unlike these methods, F UN C ODER
recursively decomposes problems into a tree structure, hence gradually reduces its complexity.
5 Discussion
Limitations Our approach unleashes the potential power of functions in programming, which is
advantageous on well-defined problems such as competitive programming, or program-augmented
reasoning tasks. These scenarios do not however represent all use cases, such as open-ended problems
or casual software development. Nevertheless, we believe that the idea of divide-and-conquer and
sub-modular consensus utilized by F UN C ODER can be extended to a wider range of problems, and
we consider this as a future exploration.
Broader Impact While code generation is increasingly utilized in software development, Large
Language Models (LLMs) are still prone to generating toxic, vulnerable, or malicious code. Such
programs pose risks and should be used or executed with extra caution.
6 Conclusion
In this paper, we presented F UN C ODER, a novel code generation framework that integrates the divide-
and-conquer strategy with functional consensus to address complex requirements. F UN C ODER had
demonstrated superior performance compared to state-of-the-art methods on various benchmarks
and models. Our findings highlighted the effectiveness of dynamic decomposition and functional
consensus in writing complex code, which suggests that F UN C ODER may have the potential to
empower further improvements in code generation and other fields.
9
References
Wasi Uddin Ahmad, Md Golam Rahman Tushar, Saikat Chakraborty, and Kai-Wei Chang. AVATAR: A parallel
corpus for Java-python program translation. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki
(eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp. 2268–2281, Toronto,
Canada, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.143. URL
https://aclanthology.org/2023.findings-acl.143.
Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen
Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language
models. ArXiv preprint, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia
Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Process-
ing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020,
December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/
1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney,
Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feldman, Arjun Guha, Michael Greenberg,
and Abhinav Jangda. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation.
IEEE Trans. Software Eng., 49(7):3675–3691, 2023. doi: 10.1109/TSE.2023.3267446. URL https:
//doi.org/10.1109/TSE.2023.3267446.
Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet:
Code generation with generated tests. In The Eleventh International Conference on Learning Representations,
ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023a. URL https://openreview.net/
forum?id=ktrw68Cmu9c.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan,
Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger,
Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder,
Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet,
Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-
Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir
Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam,
Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer,
Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba.
Evaluating large language models trained on code. ArXiv preprint, abs/2107.03374, 2021. URL https:
//arxiv.org/abs/2107.03374.
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling
computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research,
2023b. ISSN 2835-8856. URL https://openreview.net/forum?id=YfZ4ZPt8zd.
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug.
In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna Austria, May 7-11,
2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=KuPixIqPiq.
Ole-Johan Dahl, Edsger W. Dijkstra, and Charles Antony Richard Hoare. Structured programming, volume 8 of
A.P.I.C. Studies in data processing. Academic Press, 1972. ISBN 978-0-12-200550-3.
Dheeru Dua, Shivanshu Gupta, Sameer Singh, and Matt Gardner. Successive prompting for decomposing
complex questions. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022
Conference on Empirical Methods in Natural Language Processing, pp. 1251–1265, Abu Dhabi, United Arab
Emirates, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.81. URL
https://aclanthology.org/2022.emnlp-main.81.
10
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting
Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre-trained model for programming and natural languages.
In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics:
EMNLP 2020, pp. 1536–1547, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/
2020.findings-emnlp.139. URL https://aclanthology.org/2020.findings-emnlp.139.
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig.
PAL: program-aided language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara
Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML
2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research,
pp. 10764–10799. PMLR, 2023. URL https://proceedings.mlr.press/v202/gao23f.html.
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu,
Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model
meets programming - the rise of code intelligence. ArXiv preprint, abs/2401.14196, 2024. URL https:
//arxiv.org/abs/2401.14196.
P.R. Halmos. Naive Set Theory. Undergraduate Texts in Mathematics. Springer New York, 1998. ISBN
9780387900926. URL https://books.google.com.hk/books?id=x6cZBQ9qtgoC.
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns,
Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence
with APPS. In Joaquin Vanschoren and Sai-Kit Yeung (eds.), Proceedings of the Neural Information
Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December
2021, virtual, 2021a. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/
hash/c24cd76e1ce41366a4bbe8a49b02a028-Abstract-round2.html.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song,
and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin
Vanschoren and Sai-Kit Yeung (eds.), Proceedings of the Neural Information Processing Systems
Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, vir-
tual, 2021b. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/
be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html.
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili
Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and
Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. In The
Twelfth International Conference on Learning Representations, ICLR 2024, Vienna Austria, May 7-11, 2024.
OpenReview.net, 2024. URL https://openreview.net/forum?id=VtmBAGCN7o.
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet
challenge: Evaluating the state of semantic code search. ArXiv preprint, abs/1909.09436, 2019a. URL
https://arxiv.org/abs/1909.09436.
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet
challenge: Evaluating the state of semantic code search. ArXiv preprint, abs/1909.09436, 2019b. URL
https://arxiv.org/abs/1909.09436.
Shima Imani, Liang Du, and Harsh Shrivastava. MathPrompter: Mathematical reasoning using large language
models. In Sunayana Sitaram, Beata Beigman Klebanov, and Jason D Williams (eds.), Proceedings of the
61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pp. 37–42,
Toronto, Canada, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-industry.4.
URL https://aclanthology.org/2023.acl-industry.4.
Xue Jiang, Yihong Dong, Lecheng Wang, Qiwei Shang, and Ge Li. Self-planning code generation with large
language model. ArXiv preprint, abs/2303.06689, 2023. URL https://arxiv.org/abs/2303.06689.
Xin Jin, Jonathan Larson, Weiwei Yang, and Zhiqiang Lin. Binary code summarization: Benchmarking
chatgpt/gpt-4 and other large language models. ArXiv preprint, abs/2312.09601, 2023. URL https:
//arxiv.org/abs/2312.09601.
Mohammad Abdullah Matin Khan, M. Saiful Bari, Xuan Long Do, Weishi Wang, Md. Rizwan Parvez, and
Shafiq R. Joty. xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation,
translation and retrieval. ArXiv preprint, abs/2303.03004, 2023. URL https://arxiv.org/abs/2303.
03004.
11
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon-
zalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with
pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei,
Fei Xia, and Brian Ichter. Chain of code: Reasoning with a language model-augmented code emulator. ArXiv
preprint, abs/2312.04474, 2023a. URL https://arxiv.org/abs/2312.04474.
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc
Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas
Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas
Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin
Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy V, Jason Stillerman, Siva Sankalp Patel, Dmitry
Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Moustafa-Fahmy, Urvashi Bhattacharyya,
Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel
Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri
Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish
Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean
Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder: may the source be
with you! ArXiv preprint, abs/2305.06161, 2023b. URL https://arxiv.org/abs/2305.06161.
Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles,
James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume,
Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Mol-
loy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu,
and Oriol Vinyals. Competition-level code generation with alphacode. ArXiv preprint, abs/2203.07814, 2022.
URL https://arxiv.org/abs/2203.07814.
Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-
completion systems. In The Twelfth International Conference on Learning Representations, ICLR 2024,
Vienna Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=
pPjZIOuQuF.
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang,
Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes
Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal,
Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß,
Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru
Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex
Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian J. McAuley,
Han Hu, Torsten Scholak, Sébastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados,
and et al. Starcoder 2 and the stack v2: The next generation. ArXiv preprint, abs/2402.19173, 2024. URL
https://arxiv.org/abs/2402.19173.
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement,
Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tu-
fano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie
Liu. Codexglue: A machine learning benchmark dataset for code understanding and generation. In
Joaquin Vanschoren and Sai-Kit Yeung (eds.), Proceedings of the Neural Information Processing Sys-
tems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021,
virtual, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/
c16a5320fa475530d9583c34fd356ef5-Abstract-round1.html.
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha
Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine
Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-
feedback. In Thirty-seventh Conference on Neural Information Processing Systems, NeurIPS 2023, 2023.
URL https://openreview.net/forum?id=S37hOerQLB.
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming
Xiong. Codegen: An open large language model for code with multi-turn program synthesis. In The
Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
OpenReview.net, 2023. URL https://openreview.net/forum?id=iaYcJKpY2B_.
12
Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. Is
self-repair a silver bullet for code generation? In The Twelfth International Conference on Learning
Representations, ICLR 2024, Vienna Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://
openreview.net/forum?id=y0GJXRungR.
OpenAI. GPT-4 technical report. ArXiv preprint, abs/2303.08774, 2023. URL https://arxiv.org/abs/
2303.08774.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin,
Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser
Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan
Leike, and Ryan Lowe. Training language models to follow instructions with human feed-
back. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/
b1efde53be364a73914f58805a001731-Abstract-Conference.html.
Nikhil Pinnaparaju, Reshinth Adithyan, Duy Phung, Jonathan Tow, James Baicoianu, Ashish Datta, Maksym
Zhuravinskyi, Dakota Mahan, Marco Bellagente, Carlos Riquelme, and Nathan Cooper. Stable code technical
report. ArXiv preprint, abs/2404.01226, 2024. URL https://arxiv.org/abs/2404.01226.
Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong
Sun. Communicative agents for software development. ArXiv preprint, abs/2307.07924, 2023. URL
https://arxiv.org/abs/2307.07924.
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu
Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian
Canton-Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo
Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open
foundation models for code. ArXiv preprint, abs/2308.12950, 2023. URL https://arxiv.org/abs/2308.
12950.
Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. An empirical evaluation of using large language
models for automated unit test generation. IEEE Trans. Software Eng., 50(1):85–105, 2024. doi: 10.1109/
TSE.2023.3334955. URL https://doi.org/10.1109/TSE.2023.3334955.
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: language
agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing
Systems, NeurIPS 2023, 2023. URL https://openreview.net/forum?id=vAElhFcKW6.
Mohammed Latif Siddiq, Joanna C. S. Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and
Vinicius Carvalho Lopes. Exploring the effectiveness of large language models in generating unit tests. ArXiv
preprint, abs/2305.00418, 2023. URL https://arxiv.org/abs/2305.00418.
Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. Adaplanner: Adaptive planning from
feedback with language models. In Thirty-seventh Conference on Neural Information Processing Systems,
NeurIPS 2023, 2023. URL https://openreview.net/forum?id=rnKgbKmelt.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya
Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao,
Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas,
Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux,
Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov,
Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan
Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang,
Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela
Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas
Scialom. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023. URL
https://arxiv.org/abs/2307.09288.
Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code
actions elicit better LLM agents. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024.
URL https://openreview.net/forum?id=8oJyuXfrPv.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery,
and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The
Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
OpenReview.net, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw.
13
Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. CodeT5: Identifier-aware unified pre-trained
encoder-decoder models for code understanding and generation. In Marie-Francine Moens, Xuanjing
Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing, pp. 8696–8708, Online and Punta Cana, Dominican Republic,
2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.685. URL https:
//aclanthology.org/2021.emnlp-main.685.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le,
and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo,
S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information
Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022,
New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_
files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac,
Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s transformers: State-of-the-art
natural language processing. ArXiv preprint, abs/1910.03771, 2019. URL https://arxiv.org/abs/1910.
03771.
Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, and Wen Wang. CodeTransOcean: A comprehensive
multilingual benchmark for code translation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of
the Association for Computational Linguistics: EMNLP 2023, pp. 5067–5089, Singapore, 2023. Association
for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.337. URL https://aclanthology.
org/2023.findings-emnlp.337.
Ke Yang, Jiateng Liu, John Wu, Chaoqi Yang, Yi Fung, Sha Li, Zixuan Huang, Xu Cao, Xingyao Wang, Heng Ji,
and ChengXiang Zhai. If LLM is the wizard, then code is the wand: A survey on how code empowers large
language models to serve as intelligent agents. In ICLR 2024 Workshop on Large Language Model (LLM)
Agents, 2024. URL https://openreview.net/forum?id=8dmNOD9hbq.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R Narasimhan.
Tree of thoughts: Deliberate problem solving with large language models. In Thirty-seventh Conference on
Neural Information Processing Systems, NeurIPS 2023, 2023. URL https://openreview.net/forum?
id=5Xc1ecxO1h.
Eric Zelikman, Qian Huang, Gabriel Poesia, Noah Goodman, and Nick Haber. Parsel: Algorithmic reasoning
with language models by composing decompositions. In Thirty-seventh Conference on Neural Information
Processing Systems, NeurIPS 2023, 2023. URL https://openreview.net/forum?id=qd9qcbVAwQ.
Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu
Chen. RepoCoder: Repository-level code completion through iterative retrieval and generation. In Houda
Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in
Natural Language Processing, pp. 2471–2484, Singapore, 2023a. Association for Computational Linguistics.
doi: 10.18653/v1/2023.emnlp-main.151. URL https://aclanthology.org/2023.emnlp-main.151.
Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu
Chen. RepoCoder: Repository-level code completion through iterative retrieval and generation. In Houda
Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in
Natural Language Processing, pp. 2471–2484, Singapore, 2023b. Association for Computational Linguistics.
doi: 10.18653/v1/2023.emnlp-main.151. URL https://aclanthology.org/2023.emnlp-main.151.
Yifan Zhang, Jingqin Yang, Yang Yuan, and Andrew Chi-Chih Yao. Cumulative reasoning with large language
models. In ICLR 2024 Workshop on Bridging the Gap Between Practice and Theory in Deep Learning, 2024.
URL https://openreview.net/forum?id=XAAYyRxTlQ.
Lily Zhong, Zilong Wang, and Jingbo Shang. LDB: A large language model debugger via verifying runtime
execution step-by-step. ArXiv preprint, abs/2402.16906, 2024. URL https://arxiv.org/abs/2402.
16906.
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire
Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning
in large language models. In The Eleventh International Conference on Learning Representations, ICLR
2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?
id=WZH7099tgfM.
14
Ming Zhu, Karthik Suresh, and Chandan K. Reddy. Multilingual code snippets training for program translation.
In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative
Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in
Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pp. 11783–11790. AAAI Press,
2022. URL https://ojs.aaai.org/index.php/AAAI/article/view/21434.
15
A Appendix
In the supplementary materials, we provide the details of implementation (A.1), baseline information
and settings (A.2), benchmarks (A.3), metrics (A.4), settings in the analysis (A.5), and additional
experiments (A.6). We also demonstrate the example solutions of our method and baseline in
Appendix B, and include all the prompts in Appendix C.
Alias Description
(i) Symbols
f (x) Function In the programming language, a function consists of header,
documentation, and its body {hf , df , bf }. A function can
also be viewed as a mapping f : D(f ) → Y .
hf Function Header Declares the function name, arguments, and return type, and
is used as a signature to identify the function in a program.
df Function Docstring Provides additional usage details for this function, but is
(or Documentation) optional. We encourage the model to generate docstrings to
describe sub-goals precisely.
bf Function Body The function body contains a subroutine that describes its
(or Implementation) control flow and behavior. Functions may be invoked from
within.
f ′ (x) Partially Implemented A provisional function structure generated by the LLM
where sub-procedures are not yet implemented.
f ∗ (x) Solved Function A final implementation that is no longer changed and rep-
resents F UN C ODER’s final comprehension and solution on
the original problem.
Functions that re-implement f ′ (x) based on solved sub-
F = f(i) Sampled Implementation
functions, generated by models using the same input prompt.
C HILD(f (x)) Dependency Functions that are used in f (x). (exclude f (x) itself)
T Dependency Tree Defined by T REE(f, C HILD(f )), where f is the root node
of the current sub-task. Circular references are ignored.
F Function Composition To implement a certain function f respecting sub-procedures
as potentially reusable components.
(ii) Glossary
System Test Hidden Test System testing is a phase where a set of previously invisible
test cases are run against the submitted program to validate
if the code is correct and produces the expected output for
different categories of inputs.
Unit Test Assertion A unit test is an assertion consisting of given input and
expected output, whereas in Python, it takes the form of
assert func(x) == y.
Self-testing - Self-testing is an evaluation process that prompts the model
to generate unit tests (assertions) to assess the correctness
of the generated program.
Models We access the OpenAI models GPT-3.5 (gpt-3.5-turbo-0613) and GPT-4 (gpt-4-1106-
preview) through the Azure API. Weights of community models Llama3 (Meta-Llama-3-8B-Instruct),
StableCode (stable-code-instruct-3b), and CodeLlama (CodeLlama-34b-Instruct-hf) are downloaded
from HuggingFace (Wolf et al., 2019) and are served over an OpenAI-like API on a single A100-80G
GPU under BF16 precision with vLLM (Kwon et al., 2023).
Divide We instruct the model to write the current function and introduce new functions with clearly
defined sub-goals. The prompt C.2 for the divide process includes two examples: one example needs
to involve new functions that are left unimplemented; and another where the sub-goal is simple
16
enough that no further decomposition is necessary. The model generates a Python code block with
a temperature of 0.2, and the code block will be extracted to represent a tree of functions with new
functions as the children of the current. We require that any new sub-function do not refer to existing
functions, to avoid circular references. This generation process will be attempted at most 3 times
until any valid code with a proper signature is found in the output. F UN C ODER then traverses the
function tree via depth-first search and restricts the max depth of the tree to 6.
Conquer We apply the composition of sub-functions to rewrite the parent function after all sub-
functions have been fully decomposed. Code for sub-functions is made visible to the LLM, which is
requested to rewrite the current function with a 1-shot demonstration (C.3). With functional consensus
applied, the model samples multiple implementations with a temperature of 0.8, and the one that
reaches consensus will be kept for further bottom-up processing.
Functional Consensus The functional consensus is applied in the conquer stage. Formally, Consen-
sus@k samples k-1 implementations in the conquer stage, and reuses the one produced in the divide
stage, resulting in a set F of k candidate programs. Then we prompt the model with 1-shot (C.4) to
generate potential inputs X for the given function and use them to feed and execute the program.
As described in Eq 2, when two functions output the same value in a given input, they will both
add 1 point to the overall similarity. A thrown exception or timeout during execution assigns -100
points to the candidate as it indicates potentially problematic code. Similar to self-testing methods,
we also leverage the example input/output at the root node to filter out candidates that have wrong
functionality. Finally, the one candidate with maximum scores over all inputs is selected, as it reaches
consensus with other implementations.
Standard Prompting conducts one-time generation and directly output the entire code or final
results. In code generation tasks, we use a 1-shot prompting setting with 0.3 temperature. For MATH,
we sample 1 question-answer pair per subject in the train set, resulting in a 7-shot prompt, and run
self-consistency (Wang et al., 2023) with consistency@5 and temperature 0.7.
CodeT (Chen et al., 2023a) samples multiple code solutions X and unit-tests Y . A unit test is
an assertion consisting of given input and expected output, whereas in Python it takes the form of
"assert func(x) == y", CodeT then checks the programs over self-tests and divides the functions
into sets; the score of such a set is defined as the number of functions within multiplied by the number
of succeeded tests. Finally, CodeT selects the function with the most agreement (in the biggest set).
Similar to the setting of F UN C ODER, we sample 11 candidate solutions with 0.8 temperature.
Parsel (Zelikman et al., 2023) consists of three generation stages: high-level sketch, Parsel program,
and final program. The Parsel program is an intermediate representation of code that describes and
organizes program structure. We report the result of HumanEval with GPT-4 from the original paper.
Reflexion (Shinn et al., 2023) is a closed-loop agent system that generates unit tests and iteratively
refines the program based on the self-test feedback. The results for GPT-4 on HumanEval and MBPP
are reported in the original paper. Based on officially released code3 , we test results with GPT-3.5 and
community models under the reflexion strategy with max_iters=2 and Pass@1. For the xCodeEval
benchmark, as it is judged through standard input/output, we wrap the standard input into function
arguments and obtain the return value as the output in the form of "def main(input_str: str)
-> str", and the sample input/output are also transformed to visible tests for reflexion process.
3
GitHub: noahshinn/reflexion
17
MetaGPT (Hong et al., 2024) employs a multi-agent strategy that assigns roles and encodes human-
like software development procedures. The scripts for reproducing the results were not made public
as of this paper was completed. Therefore, we include the original result for GPT-4 on the HumanEval
dataset under the with feedback setting.
LDB (Zhong et al., 2024) segments programs into basic blocks and tracks the values of intermediate
variables after each block throughout runtime execution, allowing large language models to verify
the correctness of smaller code units. We adopt the results as-is reported in the paper.
Chain-of-Thought Prompting (Wei et al., 2022) generates step-by-step reasoning leading to the
final output answer. The solution is formatted in LATEX, and use \boxed to mark the final answer. We
sample 1 shot per subject in the MATH train set, resulting in a 7-shot demonstration, and running
with consistency@5 and a temperature of 0.7.
Program-of-Thought (Chen et al., 2023b) utilizes the coding ability in LLMs to generate programs
rather than text-based solutions for reasoning tasks. In MATH, we hint the model with 1-shot
prompting to generate a solution() function that returns the final answer to the problem. The
program is then executed in a Python environment and obtains the return value. If an exception is
thrown during execution, the model will try to regenerate a new program until it succeeds or reaches
3 attempts. Similar to CoT, Program-of-Thought samples 5 programs at a temperature of 0.7 and
votes the final result.
Self-Refine (Madaan et al., 2023) iteratively prompts the model to give feedback and refine the
generated code based on it. Self-refine does not incorporate self-tests, and the refinement is conducted
solely on model feedback. In our preliminary study on HumanEval, this feedback is weak and cannot
improve performance. However, in MATH, the solution program can be executed without the need
for generated assertions. Thus, we extend the self-refine to capture the runtime error trace as feedback
and refine the code until it can run or exceed 3 retries.
Cumulative Reasoning (Zhang et al., 2024) starts from decomposing the input problem into propo-
sitions and conducts bottom-up reasoning until the final answer can be concluded. The results for
Cumulative Reasoning are reported in the original paper under with code setting.
Table 7: Overview and details of HumanEval, MBPP, xCodeEval, and MATH dataset.
MBPP (Austin et al., 2021) consists of fundamental Python programming problems, with a total of
974 examples covering Python programming basics, standard library usage, and related assessment.
Following Shinn et al. (2023), we adopt the mbpp-typed split from MultiPL-E (Cassano et al., 2023)
18
and sample 200 instances, using Pass@1 as the metric. The original prompt4 from MBPP includes all
hidden tests in the input problem, which may cause label leakage when using these tests to refine or
select programs. To ensure a fair comparison, MultiPL-E removes the test information in the prompt.
xCodeEval (Khan et al., 2023) is a competition-level multilingual and multitask benchmark con-
sisting of 17 programming languages. xCodeEval collects 25 million openly available samples
from codeforces.com, a platform for competitive programming. The data we use include problem
descriptions in problem_descriptions.jsonl and system tests from unittest_db.json which
consists of 7,635 competition problems and averaged 51.1 tests per problem. Note that the tests
in xCodeEval are crawled, some of them are incomplete due to the website context limit (they end
with an ellipsis and the further content is missing); we filter out problems having invalid test cases.
Based on the CodeForces Rating (EbTech, 2024), we categorize the problems by their difficulty: Easy
(≤ 1200), Mid (1200-1599), Hard (1600-1999), and Expert (≥ 2000). We sample 500 problems
from the full split with the basic filter rule mentioned above, resulting in Table 8. The CodeForces
problem has a different input/output style compared to HumanEval and MBPP; it scans input from
Standard Input and prints the answer to Standard Output. Therefore, we judge the program on system
tests using a CodeForces-style judger and use Pass@1 (the program must pass all system tests) as the
evaluation metric.
Table 8: Number of test samples in (a) xCodeEval difficulty, (b) MATH level, (c) MATH subject.
A.4 Metrics
Pass@k When a program is passed (or accepted), it means that the program must pass all system
tests without errors and within the time limit. In our experiments, we set the time limit to 2.5
seconds. Pass@k judges k independent programs, and if any of them can pass, the result will be 1.
In most of our experiments, we use Pass@1 as the metric, as it reflects the accuracy of the method
framework achieved without feedback from humans. Pass@k, on the other hand, is equivalent to
filtering programs through hidden, human-annotated test labels.
EM-GPT The ground truth label in MATH is written in LATEX, and the accuracy between labels and
model predictions cannot be directly calculated through exact-match (EM). MATH provides a judge
program5 that preprocesses LaTeX syntax and check whether two disambiguated strings are equal.
However, this is insufficient for evaluating LaTeX-formatted labels with variant program outputs. We
follow the evaluation criteria from previous work (Zhang et al., 2024), using GPT-4 to assess the
consistency between predictions and ground truths, with prompt shown in C.6.
4
Original MBPP prompt: https://github.com/google-research/google-research/tree/master/mbpp
5
math_equivalence: https://github.com/hendrycks/math/blob/main/modeling/math_equivalence.py
19
A.5 Details of Analysis
Details of Preliminary Analysis on Self-testing (Figure 3.a) The preliminary study is conducted
on the HumanEval dataset, which includes system tests S to evaluate the accuracy of the program, as
well as one human-annotated canonical solution c. For each question, we: (1) Obtain one solution
program p from Standard Prompting. (2) Prompt the model to generate 7 self-tests T based on the
question and entry function. The self-test is in the form of the unit test assert f(x) == y. We
then judge the generated program p and canonical solution c over the self-tests T and system tests S.
Formally, a pair (x, Y ) is used to identify whether program x passes test Y . Where (p, S) indicates
that the program can pass the system tests, demonstrating its correctness. And ¬(c, T ) means the
canonical solution can not pass self-tests, suggesting that the tests generated by model could be
wrong. The self-test results on generated programs are first evaluated and divided into two classes:
self-test passed or failed. If the self-test passes, the self-improvement methods will stop iteration and
pick this program as a final result. The next step is to determine whether the program can pass system
tests. If the self-test fails, it indicated that there could be an error in the program or test itself. In this
case, the correctness of the program is checked using final tests (p, S) and the correctness of the unit
test by canonical program (c, T ). The results on GPT-3.5 and StableCode are shown in Figure 3 and
detailed explanations about these conditions can be found in Table 9.
Details of Ranking Strategy Comparison (Figure 3.b) We obtain 11 candidate programs from
F UN C ODER on HumanEval with GPT-3.5 and rank them through three strategies. This ensures that
the same candidate set is used for a fair comparison. An effective ranking strategy should prioritize
placing correct programs at the forefront and filter out those with errors. Thus, we measure the
effectiveness by computing Pass@k results on the top-k-ranked programs selected by each strategy.
The Pass@11 result serves as an upper bound as it uses all programs to compute the pass rate.
How We Count Frequently Used Functions in MATH (Table 5) In the mathematical reasoning
experiments, we used a subset of 500 items from the MATH test set, with an average of 71.4 questions
per subject. However, it is not very confident to represent common functions from only 71.4 programs.
Therefore, we sample 3000 problems from the MATH test set for this experiment and run the divide-
only setting of F UN C ODER on them. Then, the occurrence of sub-functions is counted based on their
names after extracting the function nodes of code trees for each category.
20
A.6 Supplementary Results
Open-source model on MBPP and xCodeEval We provide the results for community models
Llama3, StableCode, and CodeLlama on other code generation benchmarks in Table 10. Our method
consistently improves the baseline by averaging 12% on MBPP and 197% on xCodeEval. It is worth
noting that the small model has a low pass rate on the competition problems, leading to a relatively
high randomness, so we report the median by running 3 experiments.
Table 10: Results for MBPP and xCodeEval with community models.
MBPP xCodeEval
Model Method
Pass@1 ∆↑ Easy Mid Hard Expert All
Standard 60.5 - 9.0 1.8 0.0 0.0 3.6
Llama38b
F UN C ODER 62.5 +2.0 22.0 0.9 0.0 0.0 8.0
Standard 51.5 - 7.3 0.9 0.0 0.0 2.8
StableCode3b
F UN C ODER 63.5 +12.0 13.5 4.5 1.1 0.0 6.2
Standard 53.5 - 2.3 0.0 0.0 0.0 0.8
CodeLlama34b
F UN C ODER 58.5 +5.0 10.2 0.0 0.0 0.0 3.6
Token Usage on Other Methods We provide token usage results in Table 11 for F UN C ODER and
baseline methods on the HumanEval dataset with the GPT-3.5 model. We report the average token
usage per problem. The token usage is computed through the sum of prompt tokens and completion
tokens returned by OpenAI API chat completion call6 . For LDB, we report their token usage in the
original paper (Zhong et al., 2024).
Table 11: Token usage for different settings of F UN C ODER and baseline methods. The LDB results
are reported in the original paper. The main setting for LDB and F UN C ODER is bolded.
Full Results for MATH Levels The MATH dataset divides the problems into five levels of difficulty.
The difficulty distribution of our test set can be found in Table 8. We report the average accuracy
of F UN C ODER and other methods at each level in Table 12. The results of Cumulative Reasoning
are obtained from the original paper (Zhang et al., 2024). Experiment results demonstrate that our
method consistently enhances the model’s reasoning ability across all levels of MATH.
6
https://platform.openai.com/docs/guides/text-generation/managing-tokens
21
Table 12: Full results of each method at different levels of MATH. The best results are in bold.
Text-based reasoning methods are denoted with † , while others use program-aided reasoning.
22
B Examples
We provide example solutions for the baseline and F UN C ODER on code generation and mathematical
reasoning. All results are generated with GPT-3.5.
## Description
You have two friends . You want to present each of them several
positive integers . You want to present cnt1 numbers to the first
friend and cnt2 numbers to the second friend . Moreover , you want
all presented numbers to be distinct , that also means that no
number should be presented to both friends . In addition , the first
friend does not like the numbers that are divisible without
remainder by prime number x . The second one does not like the
numbers that are divisible without remainder by prime number y . Of
course , you ' re not going to present your friends numbers they don
' t like . Your task is to find such minimum number v , that you can
form presents using numbers from a set 1 ,2 ,... , v . Of course you
may choose not to present some numbers at all . A positive integer
number greater than 1 is called prime if it has no positive
divisors other than 1 and itself .
## Input
The only line contains four positive integers cnt1 , cnt2 , x , y (1 <=
cnt1 , cnt2 <109; cnt1 + cnt2 <=109; 2 <=x <y <=3*104) - the numbers that
are described in the statement . It is guaranteed that numbers x , y
are prime .
## Output
## Sample Input 1
```
3 1 2 3
```
## Sample Output 1
```
5
```
## Sample Input 2
```
1 3 2 3
```
## Sample Output 2
```
23
4
```
## Notes
In the first sample you give the set of numbers {1 ,3 ,5} to the first
friend and the set of numbers {2} to the second friend . Note that
if you give set {1 , 3 , 5} to the first friend , then we cannot give
any of the numbers 1 , 3 , 5 to the second friend . In the second
sample you give the set of numbers {3} to the first friend , and
the set of numbers {1 , 2 , 4} to the second friend . Thus , the
answer to the problem is 4.
B.1.1 Standard
def main () -> None :
cnt1 , cnt2 , x , y = list ( map ( int , input () . split () ) )
# binary search
l , r = 1 , 2 * ( cnt1 + cnt2 ) # upper bound
while l < r :
mid = ( l + r ) // 2
# number of numbers that can be given to the first friend
cnt_x = mid // x
# number of numbers that can be given to the second friend
cnt_y = mid // y
# number of numbers that can be given to both friend
cnt_xy = mid // ( x * y )
# number of numbers that can 't be given to any friend
cnt_none = mid - cnt_x - cnt_y + cnt_xy
if cnt_x >= cnt1 and cnt_y >= cnt2 and cnt_none >= cnt1 + cnt2
:
r = mid
else :
l = mid + 1
print ( l )
Result: Wrong answer on sample test 1: expected: ’5’, found: ’8’ [1th token] ✗
B.1.2 F UN C ODER
def main () -> None :
cnt1 , cnt2 , x , y = map ( int , input () . split () )
left = 1
right = int (2 e9 )
while left < right :
mid = ( left + right ) // 2
if is_valid ( mid , x , y , cnt1 , cnt2 ) :
right = mid
else :
left = mid + 1
print ( left )
24
# check if there are cnt1 + cnt2 numbers divisible by both x and y
lcm = ( x * y ) // gcd (x , y )
common_numbers = v - ( v // lcm )
if common_numbers < cnt1 + cnt2 :
return False
return True
while b :
a, b = b, a % b
return a
Result: Accepted ✓
B.2.1 CoT
We can solve for $a$ by multiplying both sides of the congruence by
the modular inverse of $27$ modulo $40$ . Since $27$ and $40$ are
relatively prime , we know that $27$ has a modular inverse modulo
$40$ . We can find this inverse using the Euclidean algorithm :
\ begin { align *}
\ text { gcd }(40 ,27) &= \ text { gcd }(27 ,13)
&= \ text { gcd }(13 ,1)
&= 1.
\ end { align *}
Therefore , the modular inverse of $27$ modulo $40$ exists . We can find
it using the Euclidean algorithm backwards :
\ begin { align *}
1 &= 13 - 3 \ cdot 4
&= 13 - 3(40 - 27)
&= 4 \ cdot 27 - 3 \ cdot 40.
\ end { align *}
Thus , the modular inverse of $27$ modulo $40$ is $4$ . Multiplying both
sides of the congruence $27a \ equiv 17 \ pmod {40} $ by $4$ gives
$108a \ equiv 68 \ pmod {40} $ . Simplifying the left - hand side gives
$8a \ equiv 28 \ pmod {40} $ . Dividing both sides by $4$ gives $2a \
equiv 7 \ pmod {10} $ . The smallest positive integer $a$ satisfying
this congruence is $a =9 $ , and the second - smallest is $a =19 $ .
Therefore , the sum of the smallest and second - smallest positive
integers $a$ satisfying the congruence $27a \ equiv 17 \ pmod {40} $ is
$9 +19=\ boxed {28} $ .
Result: 28 ✗
25
B.2.2 PoT
def solution () :
r """ What is the sum of the smallest and second - smallest positive
integers $a$ satisfying the congruence $$27a \ equiv 17 \ pmod {40}~?
$$ """
k = 1
candidates = []
while len ( candidates ) < 2:
a = (40 * k + 17) // 81
if a > 0 and (40 * k + 17) % 81 == 0:
candidates . append ( a )
k += 1
B.2.3 F UN C ODER
from typing import Tuple
def solution () :
r """ What is the sum of the smallest and second - smallest positive
integers $a$ satisfying the congruence $$27a \ equiv 17 \ pmod {40}~?
$$ """
a = 27
m = 40
b = 17
inverse = find_inverse (a , m )
x = ( b * inverse ) % m
_ , x , _ = extended_euclid (a , m )
return x % m if x >= 0 else ( x + m ) % m
26
C Prompts
C.1 Prompt for Standard
You are a programming copilot , you can solve a problem by writing
Python functions . Your task is to :
Let ' s think step by step and complete the following Python function `
sum_factor ` that solves :
``` python
def sum_factor ( a : int , b : int ) -> int :
""" Return the sum of all common prime factors of $a$ and $b$ """
raise NotImplementedError ()
```
``` python
def sum_factor ( a : int , b : int ) -> int :
""" Return the sum of all common prime factors of $a$ and $b$ """
factors_a = prime_factor ( a )
factors_b = prime_factor ( b )
common_factors = get_common ( factors_a , factors_b )
return sum ( common_factors )
27
```
Let ' s think step by step and complete the following Python function `{
cur_func_name } ` that solves :
``` python
{ cur_func }
```
- For every turn , you need to write a Python function that returns
the answer based on Current Code ( not code in chat history ) .
- Do not modify function name , arg names , docstring in given
functions .
- You can import libraries to better solve the problem .
- If a single function is too hard to solve , you can decompose it
into multiple smaller functions .
- You can leave new function unimplemented for now , but write the
function at the end of the code and comment what the function does
."
Current Code :
``` python
def sum_common_factors ( a : int , b : int ) -> int :
""" Compute the sum of all common prime factors of $a$ and $b$ """
raise NotImplementedError ()
```
Let ' s think step by step and complete the following Python function `
sum_common_factors ` that solves :
" Compute the sum of all common prime factors of $a$ and $b$ "
``` python
def sum_common_factors ( a : int , b : int ) -> int :
""" Compute the sum of all common prime factors of $a$ and $b$ """
factors_a = prime_factor ( a )
factors_b = prime_factor ( b )
common_factors = get_common ( factors_a , factors_b )
return sum ( common_factors )
28
< User >:
Current Code :
``` python
def sum_common_factors ( a : int , b : int ) -> int :
""" Compute the sum of all common prime factors of $a$ and $b$ """
factors_a = prime_factor ( a )
factors_b = prime_factor ( b )
common_factors = get_common ( factors_a , factors_b )
return sum ( common_factors )
Let ' s think step by step and complete the following Python function `
get_common ` that solves :
" get common element in two list $a$ and $b$ "
``` python
def get_common ( a : list , b : list ) -> list :
""" get common element in two list $a$ and $b$ """
ret = []
for item in a :
if item in b :
ret . append ( item )
return ret
```
Current Code :
``` python
{ prev_code }
```
Let ' s think step by step and complete the following Python function `{
cur_func_name } ` that solves :
"{ cur_func_doc }"
- For every turn , you need to write a Python function that returns
the answer , based on current code ( not code in chat history ) and
problem description .
- Do not modify function name , arg names , docstring in given
functions .
- Consider reusing existing functions that are already implemented .
- You can import libraries to better solve the problem .
Current Code :
``` python
def prime_factor ( x : int ) -> list :
""" get a list of prime factors of number $x$ """
29
ret = []
i = 1
while i * i <= x :
i += 1
if x % i == 0 and is_prime ( i ) :
ret . append ( i )
return ret
raise NotImplementedError ()
```
Let ' s think step by step and implement the following method `
sum_common_factors ` using existing functions to solve :
" Return the sum of all common prime factors of $a$ and $b$ "
``` python
def sum_common_factors ( a : int , b : int ) -> int :
""" Compute the sum of all common prime factors of $a$ and $b$ """
factors_a = prime_factor ( a )
factors_b = prime_factor ( b )
common_factors = get_common ( factors_a , factors_b )
return sum ( common_factors )
```
Current Code :
``` python
{ prev_code }
```
Let ' s think step by step and implement the following method `{
cur_func_name } ` using existing functions to solve :
"{ cur_func_doc }"
30
C.4 Prompt for Generate Possible Input
Let ' s think step by step and create some tests for the following
function ` check_valid_brackets (...) ` in Python .
``` python
def check_valid_brackets ( seq : str ) -> bool :
""" Determine if a bracket sequence consisting of '( ' , ') ', '{ ' ,
'} ' , '[ '
and '] ' is valid . """
mapping = { ') ': '( ' , '} ': '{ ' , '] ': '[ '}
stack = []
for c in seq :
if c in mapping :
if not stack or stack [ -1] != mapping [ c ]:
return False
stack . pop ()
else :
stack . append ( c )
return not stack
```
``` python
c h e ck_valid_brackets ( " () " ) # True
c h e ck_valid_brackets ( " (([[]]) ) " ) # True
c h e ck_valid_brackets ( " ((() ) " ) # False
c h e ck_valid_brackets ( " () []{} " ) # True
c h e ck_valid_brackets ( " ([) ] " ) # False
c h e ck_valid_brackets ( " " ) # True
c h e ck_valid_brackets ( " ) ( " ) # False
```
31
Let ' s think step by step and create some tests for the following
function `{ cur_func_name }(...) ` in Python .
``` python
{ prev_code }
```
- You should invoke the function and assert its results in a one -
liner fashion .
- Do not bring in imports other than what ' s already imported . Use
the pre - declared imports in the original function only .
- The callee may have multiple arguments , treat them with care .
- You ** must ** respect the function signature and docstring , and be
aware so you don ' t generate illegal inputs .
- Keep the inputs & outputs simple but general , and that either edge
cases or common cases are meaningful .
Let ' s think step by step and create some tests for the following
function ` lcm (...) ` in Python .
``` python
def lcm ( a : int , b : int ) -> int :
""" Find the least common multiple of `a ` and `b `. Samples :
>>> lcm (3 , 5)
15
>>> lcm (4 , 6)
12
"""
Store your test cases for ` lcm (...) ` as assertions , one per line . They
will be called later .
``` python
assert lcm (15 , 25) == 75
assert lcm (32 , 32) == 32
32
assert lcm (1 , 5) == 5
assert lcm (1 , 1) == 1
assert lcm (17 , 19) == 17 * 19
```
< User >:
``` python
{ prev_code }
```
33