0% found this document useful (0 votes)
44 views12 pages

Code Act

The Code Adaptive Compute-efficient Tuning (CodeACT) framework aims to enhance the performance and efficiency of open-source large language models (LLMs) in code-related tasks by focusing on high-quality data selection and efficient training processes. It introduces the Complexity and Diversity Aware Sampling (CDAS) method and a Dynamic Pack padding strategy, which together improve model performance by 8.6%, reduce training time by 78%, and decrease GPU memory usage by 27%. CodeACT addresses the dual challenges of data quality and training efficiency, paving the way for more resource-efficient and effective models.

Uploaded by

Frederico Lopes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views12 pages

Code Act

The Code Adaptive Compute-efficient Tuning (CodeACT) framework aims to enhance the performance and efficiency of open-source large language models (LLMs) in code-related tasks by focusing on high-quality data selection and efficient training processes. It introduces the Complexity and Diversity Aware Sampling (CDAS) method and a Dynamic Pack padding strategy, which together improve model performance by 8.6%, reduce training time by 78%, and decrease GPU memory usage by 27%. CodeACT addresses the dual challenges of data quality and training efficiency, paving the way for more resource-efficient and effective models.

Uploaded by

Frederico Lopes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

CodeACT: Code Adaptive Compute-efficient Tuning

Framework for Code LLMs


Weijie Lv Xuan Xia Sheng-Jun Huang
Nanjing University of Shenzhen Institute of Artificial Nanjing University of
Aeronautics and Astronautics Intelligence and Robotics for Society Aeronautics and Astronautics
Nanjing, China Shenzhen, China Nanjing, China
lvweijie@nuaa.edu.cn xiaxuan@cuhk.edu.cn huangsj@nuaa.edu.cn
arXiv:2408.02193v1 [cs.CL] 5 Aug 2024

Abstract—Large language models (LLMs) have shown great instructions and fine-tune open-source models. While these
potential in code-related tasks, yet open-source models lag behind methods have shown promise, they often lead to inefficient
their closed-source counterparts. To bridge this performance optimization and training processes due to the presence of
gap, existing methods generate vast amounts of synthetic data
for fine-tuning, leading to inefficiencies in training. Motivated low-quality synthetic data within the massive training corpora.
by the need for more effective and efficient training, we pro- Recent research, exemplified by LIMA [9], suggests that
pose the Code Adaptive Compute-efficient Tuning (CodeACT) data quality is more crucial than quantity, demonstrating su-
framework. CodeACT introduces the Complexity and Diversity perior performance with just 1,000 carefully curated samples.
Aware Sampling (CDAS) method to select high-quality training This insight raises a critical question: How can we identify
data based on complexity and diversity, and the Dynamic
Pack padding strategy to reduce computational resource usage the most influential training samples to enhance model per-
by minimizing padding tokens during training. Experimental formance and training efficiency simultaneously?
results demonstrate that CodeACT-DeepSeek-Coder-6.7B, fine- Complex programming problems typically require an inte-
tuned on only 40% of the EVOL-Instruct data, achieves an gration of various knowledge domains and skills, demanding
8.6% performance increase on HumanEval, reduces training more intricate reasoning processes than simpler ones. Intu-
time by 78%, and decreases peak GPU memory usage by 27%.
These findings underscore CodeACT’s ability to enhance the itively, these complex problems could contribute substantially
performance and efficiency of open-source models. By optimizing to model training. Additionally, numerous studies [10]–[13],
both the data selection and training processes, CodeACT offers have highlighted the importance of data diversity in improving
a comprehensive approach to improving the capabilities of model performance. These observations suggest that selecting
open-source LLMs while significantly reducing computational diverse and complex code data could be key to efficient and
requirements, addressing the dual challenges of data quality
and training efficiency, and paving the way for more resource- effective model training.
efficient and performant models. The code is available at Kyle- Building upon these insights, we propose a novel framework
Lyu/CodeACT. called Code Adaptive Compute-efficient Tuning (CodeACT).
Index Terms—AI4SE, Large Language Models, Code Genera- This framework addresses both the quality of training data
tion, Data Selection, Compute-efficient Tuning and the efficiency of the fine-tuning process, two interrelated
I. I NTRODUCTION aspects that collectively impact the performance and resource
utilization of Code LLMs. At the core of CodeACT is the
Large language models (LLMs) have recently achieved
Complexity and Diversity Aware Sampling (CDAS) method,
remarkable success across various domains, with their ap-
specifically designed to identify the most influential code
plications in code-related tasks emerging as a focal point
data. Notably, CDAS operates adaptively by utilizing the base
in software engineering research. Pioneered by models like
LLM for data selection, thereby eliminating the need for
Codex [1], LLMs have demonstrated exceptional prowess in
external LLMs. By selecting a smaller set of high-quality data,
code processing, leading to the development of commercial
CDAS aims to enhance training efficiency while maintaining
products such as GitHub Copilot1 and open-source alterna-
or improving model performance.
tives like CodeLlama [2], DeepSeek-Coder [3], and StarCoder
Complementing CDAS, we introduce the Dynamic Pack
[4]. However, a persistent performance gap remains between
padding strategy to optimize the resource utilization. Tradi-
open-source and closed-source models, particularly in code
tional padding strategies often introduce a large number of
generation tasks.
padding tokens, leading to inefficient resource utilization and
Instruction fine-tuning is a method employed to refine the
prolonged training times. Dynamic Pack addresses this by
performance of LLMs, predominantly relying on amassing
sorting data within a batch by length and merging multiple
large datasets. Approaches such as CodeAlpaca [5], EVOL-
instances, effectively reducing the rate of padding tokens.
Instruct [6], and OSS-Instruct [7] have leveraged more pow-
This technique not only accelerates the training process but
erful LLMs (e.g., GPT-4 [8]) to generate synthetic coding
also decreases computational resource consumption, further
1 https://github.com/features/copilot enhancing the efficiency gains achieved through CDAS’s data
selection. B. Instruction-Following Difficulty Score
The synergy between CDAS and Dynamic Pack in the Cherry LLM [20] introduces a self-guided approach for
CodeACT framework enables significant improvements in evaluating data complexity that does not rely on powerful
both model performance and training efficiency. CDAS con- external models (e.g., GPT-4). This method leverages the
tributes to efficiency by selecting a reduced set of influen- Instruction-Following Difficulty (IFD) score, which is calcu-
tial data, while Dynamic Pack further boosts efficiency by lated using an originally pre-trained LLM. The IFD score
minimizing padding tokens during training. Our experiments is a purely statistical measure that compares the losses or
demonstrate that CodeACT-DeepSeek-Coder-7B, fine-tuned perplexities when the model generates a response a with and
on only 40% of the EVOL-Instruct data, achieves an 8.6% without the instructional context q. This comparison quantifies
increase on HumanEval (from 58.5% to 67.1%) compared to the extent to which the instruction aids in generating the
training with the full dataset. Moreover, it reduces training corresponding response. This approach is particularly advan-
time by 78% (from 297 to 68 minutes) and decreases peak tageous as it allows the model to autonomously evaluate the
GPU memory usage by 27% (from 50.75 GB to 37.23 GB). complexity of the data, thereby making the process more
The main contributions of this paper are as follows: efficient and scalable.
• We propose the CodeACT framework, which integrates A higher IFD score indicates that, even with the given
data selection and an efficient padding strategy to enhance instruction q, the model struggles to generate an accurate re-
both the performance and training efficiency of LLMs. sponse a, reflecting the complexity of the sample. Conversely,
• We introduce the CDAS method, an adaptive sampling a lower IFD score suggests that the instruction q significantly
method specifically designed for code data, which iden- facilitates the generation of the correct response a without
tifies influential training samples by considering both needing further training, which may be due to the sample
complexity and diversity. being straightforward or the instruction containing detailed
• We develop the Dynamic Pack strategy, which signifi- information that aligns closely with the model’s pre-trained
cantly reduces padding tokens during the training phase, knowledge. The IFD score for a given instruction-response
thereby further improving training efficiency and resource data pair is calculated as follows:
utilization. PPL(ai | qi )
• Extensive experimental results validate the effectiveness IFD(ai | qi ) = , (2)
PPL(ai )
of our framework, demonstrating superior performance
with less data and substantially enhanced training effi- where PPL(ai | qi ) and PPL(ai ) denote the perplexities of the
ciency. model in fitting response ai with and without the instruction
qi , respectively.
II. P RELIMINARIES
III. A PPROACH
We denote a dataset as D, which consists of n triplets A. Definitions
x = (Instruction, [Input], Response) representing instruc-
tion tuning data samples. Earlier instruction tuning samples Data Selection Task. Given a vast pool of instruction
typically feature separate instruction and input segments for tuning data D = {x1 , x2 , . . . , xn }, where each xi represents
better control [14]–[16], while most current datasets integrate an individual instruction-response pair (qi , ai ), our objective
(m)
the inputs with instructions [9], [17]–[19]. For simplicity, is to select a subset Sπ of size m from D, employing
let q = map(Instruction, [Input]) denote the complete a selection strategy π. To evaluate the effectiveness of the
instruction and a as the corresponding response. The mapping selected subset, we represent the alignment performance after
function may simply concatenate them with control tokens. instruction tuning as Q [10]. The optimal selection strategy
Thus, D = {(q1 , a1 ), (q2 , a2 ), . . . , (qn , an )} represents a col- π ∗ within a given data budget m is then defined as:
lection of n instruction-response pairs. π ∗ = arg max Q(Sπ(m) ). (3)
π

A. Perplexity In general, the alignment performance, as indicated by Q,


is determined by the test data. When the test data remains
In the context of instruction tuning, the objective is to
consistent, the optimal selection strategy π ∗ should be devel-
maximize the likelihood of generating the correct response
oped based on quantifiable indicators within the training data.
given the corresponding instruction. Therefore, perplexity can
We propose that data complexity and dataset diversity serve
serve as a potential metric for assessing the difficulty of
as these indicators and aim to quantify them.
samples. Specifically, the perplexity of a given sample (qi , ai )
is defined as: Complexity. Complexity measures the difficulty of a single
PN
instruction-response pair, where a higher complexity signifies
1
PPL(ai | qi ) = e− N j=1 log P (ai,j |qi ,ai,1 ,...,ai,j−1 )
, (1) a greater learning value. Complex programming problems typ-
ically require the integration of multiple domains of knowledge
where N is the length of the response ai , and ai,j represents and skills, necessitating more sophisticated reasoning and a
the j-th token in the response ai . higher level of detail than simpler problems. We introduce
Fig. 1. An overviw of our proposed CDAS method, including three steps from top to bottom. Step 1: Clustering the EVOL-Instruct dataset to form multiple
clusters. Step 2: Computing the Instruction-Following Difficulty score by comparing the model’s perplexity with and without instructions. Step 3: Sampling
the top m% instances from each re-ranked cluster to form a high-complexity sub-dataset that preserves data diversity. Finally, we use the selected data for
fine-tuning to obtain CodeACT-Coder.

the IFD score as a measure of complexity. The IFD score Diversity. Diversity measures the richness of the entire
quantifies the extent to which an instruction facilitates the dataset’s samples, with higher diversity indicating a more
generation of a corresponding response by comparing the varied and comprehensive dataset. While complexity is crucial,
model’s loss or perplexity. A higher IFD score indicates ensuring data diversity is equally important for enabling the
that the sample likely involves more complex knowledge model to perform well across various scenarios [10]–[13]. In
or uncommon combinations, thereby revealing the problem’s this paper, we define diversity as the range occupied by the
intricacy. Conversely, a lower IFD score suggests that the probability distribution of all samples in the dataset D within
sample is simpler and contains information highly consistent the semantic space. Although this distribution is unknown and
with the model’s pre-trained knowledge. prevents direct calculation of diversity, the dataset’s diversity is
fixed once created. Thus, in the task of data selection, we aim
More importantly, the IFD score serves as a purely statistical to maintain the subset’s diversity consistent with the original
measure that can reflect different models’ performance on dataset’s, as represented by the formula:
the same dataset. Different models will yield different IFD
score distributions, indicating their varied definitions and un- lim PS (x) = PD (x), (4)
|S|→∞
derstandings of complex data. Therefore, IFD demonstrates the
model’s adaptability, enabling an objective evaluation of their where |S| denotes the size of subset S, PS (x) and PD (x) are
performance across tasks of varying complexity. This charac- the probability distributions of S and D, respectively.
teristic makes the IFD score an effective tool for systematically The most straightforward approach to satisfy (4) is through
identifying and selecting complex programming problems. random sampling. Nonetheless, (4) imposes a weak constraint
on subset sampling, becoming effective only when |S| ap- each cluster is sampled. This final sampling step guarantees
proaches infinity. For smaller |S|, it fails to guarantee that the that the selected data maintains a balanced representation
distribution of S in the semantic space aligns with that of D. of complexity and diversity, thereby enhancing the model’s
A more effective strategy involves partitioning D into multiple ability to generalize across various coding tasks.
sub-datasets Di , from which we sample subsets Si , ensuring By considering both diversity and complexity dimensions
S adheres to: of data, CDAS offers a promising solution to the challenge
of selecting optimal instruction tuning data for LLMs in the
lim PSi (x) = PDi (x) for i = 1, 2, . . . , k, (5) programming domain. This method paves the way for more
|Si |→∞
effective and efficient model training, potentially leading to
where |Si | represents the size of Si , PSi (x) and PDi (x) are significant improvements in code generation and understand-
the probability distributions of Si and Di , respectively. ing capabilities.
Equation (5) provides stronger constraints compared to (4),
ensuring that the distribution range of S in semantic space C. Padding Strategy
is more consistent with D when the number of samples is Tokenization is a crucial step in the pre-training and fine-
reduced. Based on the above analysis, we propose using K- tuning processes of LLMs. The primary task of tokenization
Means to partition the dataset D into sub-datasets Di to ensure is to segment the text into smaller units for processing.
that the diversity of the sampled subset S is consistent with This step not only effectively defines how data is utilized
that of the original dataset D. To validate our choice of K- during training but also plays a pivotal role in enhancing the
Means, we conducted a comparative study with other methods model’s effectiveness and training efficiency. However, due to
in Section V-C. varying sample lengths, traditional padding strategy typically
aligns samples to the model’s maximum input length by using
B. Complexity and Diversity Aware Sampling
additional padding tokens, as shown at the top of Figure 2.
To address the challenges in selecting optimal code data, This often results in a high proportion of padding tokens,
we propose the Complexity and Diversity Aware Sampling which reduces training efficiency and increases computational
(CDAS) method. This approach is rooted in the necessity resource consumption.
of considering both the complexity and diversity of data To address this issue, the dynamic padding strategy has been
to enhance model training efficiency and effectiveness. By proposed and widely adopted. In this strategy, the maximum
integrating these two aspects, CDAS aims to improve the gen- input length is determined by the longest sample in each batch,
eralization capabilities of LLMs in the programming domain. with shorter samples being padded to match this length, as
illustrated in the middle of Figure 2. It effectively reduces
Algorithm 1 Complexity and Diversity Aware Sampling the number of padding tokens used, thereby significantly
Require: sampling proportion m%, the whole dataset D = accelerating the training process.
{(q1 , a1 ), (q2 , a2 ), . . . , (qn , an )} To further optimize the utilization of maximum input length
Ensure: Sampled dataset S and reduce the number of padding tokens, we propose the
1: Derive embeddings from instructions in D Dynamic Pack strategy. This strategy first sorts the samples
2: Partition D into k clusters based on embeddings within a batch by length and then attempts to concatenate
3: for each cluster Ck do multiple samples into a single data instance, as shown at
4: for each instance (qi , ai ) ∈ Ck do the bottom of Figure 2. This process results in a new batch
5: Compute IFD score for instance (qi , ai ) of samples, which are then padded based on the maximum
6: end for length of the new batch. The Dynamic Pack strategy not only
7: Reorder instances in Ck based on IFD scores enhances training efficiency but also further optimizes the use
8: Sample the top m% of instances from Ck of computational resources.
9: end for
10: Combine sampled instances from all clusters to form S D. Code Adaptive Compute-efficient Tuning Framework
11: return S The Code Adaptive Compute-efficient Tuning (CodeACT)
framework is meticulously crafted to refine the training process
As shown in Algorithm 1, CDAS begins by utilizing of LLMs through the integration of a sophisticated data se-
lightweight sentence transformers [21] to derive embeddings lection method and an innovative token padding strategy. The
from the given instructions. These embeddings effectively overarching objective of CodeACT is to amplify the efficiency
capture the semantic essence of the instructions, facilitating and efficacy of model training, particularly for intricate tasks
more accurate clustering. Subsequently, the dataset is parti- such as code generation.
tioned into distinct clusters, which ensures data diversity by CodeACT harnesses the CDAS method for code data se-
aggregating similar instances. Within each cluster, the IFD lection. CDAS ensures that the selected data is both diverse
score is computed for each instance, with the aim of iden- and complex, thereby facilitating the creation of a robust
tifying complex programming data. After that, the data within training dataset that bolsters the model’s generalization ca-
each cluster is reordered, and the top m% of instances from pabilities. Additionally, CodeACT introduces the Dynamic
Batch size
Padding

max seq-length of the model

sample 1
sample 2
sample 3

Batch size
Dynamic Padding
sample 4
sample 5
sample 6

max seq-length of the batch

Batch size
sort Dynamic Pack (ours)

max seq-length of the new batch

Fig. 2. Illustration of different padding strategies, where the blank squares represent padding tokens. Top: Traditional padding strategy aligns samples to the
model’s maximum input length, resulting in high computational resource consumption. Middle: Dynamic padding strategy reduces the number of padding
tokens by aligning samples to the length of the longest sample in each batch. Bottom: Our proposed Dynamic Pack strategy sorts samples by length and
concatenates multiple samples within a batch, further optimizing the utilization of the model’s maximum input length and reducing padding tokens.

Pack padding strategy. This innovative strategy overcomes the quality and depth compared to traditional Alpaca methods. We
inefficiencies of traditional and dynamic padding by sorting utilize the Evol-Instruct-Code-80K2 dataset, an open-source
samples by length and concatenating them without surpass- implementation comprising approximately 80K samples.
ing the model’s maximum input length. The Dynamic Pack OSS-Instruct. The OSS-Instruct [7] dataset leverages Chat-
strategy significantly reduces the number of padding tokens, GPT to generate programming problems and their corre-
optimizing computational resources and expediting the training sponding solutions. The generation process is controlled by
process. using real code snippets sourced from open-source codebases
By seamlessly integrating CDAS and Dynamic Pack, Code- like GitHub as seeds. This approach is distinctive because
ACT offers a holistic solution for enhancing the training it provides real-world code snippets as inspiration, prompt-
efficiency and performance of LLMs in complex tasks. This ing the language model to generate problems that closely
framework not only elevates model performance in code reflect actual programming scenarios. This not only ensures
generation but also lays the foundation for more effective and the diversity and authenticity of the generated problems but
resource-efficient training processes. also captures the various challenges encountered in real-world
programming.
IV. E XPERIMENTAL S ETUP
B. Benchmarks
A. Datasets
We employ four code benchmarks: HumanEval [22], Hu-
EVOL-Instruct. The EVOL-Instruct [6] dataset is derived
manEval+ [23], MBPP [24], and MBPP+ [23]. Consistent
from the iterative evolution of the Code Alpaca [5] dataset,
with previous research [7], [23], [25], we use greedy decoding
where instruction complexity is incrementally increased using
to generate a single sample for each benchmark and LLM,
ChatGPT with evolution prompts. These prompts encompass
focusing our comparison on the pass@1 metric.
five distinct aspects, including the imposition of constraints,
HumanEval/HumanEval+. HumanEval and its enhanced
the substitution of broad requirements with more detailed ones,
counterpart, HumanEval+, serve as critical benchmarks for
the extension of reasoning steps, the inclusion of deceptive
evaluating the code generation capabilities of LLMs. Hu-
code, and the enhancement of time or space complexity. Each
manEval comprises 164 manually-written Python problems,
instruction undergoes multiple iterations of evolution, during
each accompanied by an average of 9.6 test cases. Hu-
which pruning and post-processing are performed to eliminate
manEval+ builds upon this by significantly increasing the
undesirable instructions and responses. This iterative com-
plexity augmentation method produces instructions of higher 2 https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1
TABLE I. Performance comparison of our framework across different datasets and models. The CodeACT column indicates whether the model was trained
using our framework. The bold scores represent the best performance achieved using the same base model. The results highlight the efficiency gains achieved
by CodeACT in terms of reduced training time and peak GPU memory usage, while maintaining or improving performance across various benchmarks.

Efficiency Benchmark(Pass@1 %)
Model Size CodeACT
Training Time Peak GPU Memory HumanEval HumanEval+ MBPP MBPP+
Models trained on OSS-Instruct dataset
✗ 220 min 64.28 GB 50.6 47.0 63.2 51.4
CodeLlama 7B
✓ 63 min 33.21 GB 54.3 50.0 60.4 50.4
✗ 367 min 69.17 GB 58.5 52.4 63.2 51.9
CodeLlama 13B
✓ 109 min 59.62 GB 59.8 52.4 63.2 50.6
✗ 184 min 64.48 GB 65.2 61.0 75.9 63.4
DeepSeek-Coder 6.7B
✓ 54 min 35.51 GB 68.3 61.6 75.9 61.7
Models trained on EVOL-Instruct dataset
✗ 297 min 50.65 GB 54.3 50.0 60.7 48.6
CodeLlama 7B
✓ 68 min 38.25 GB 53.0 47.0 60.7 49.9
✗ 468 min 69.17 GB 62.2 56.7 63.2 52.9
CodeLlama 13B
✓ 116 min 58.82 GB 64.0 55.5 62.4 51.6
✗ 259 min 50.75 GB 58.5 53.7 71.4 58.1
DeepSeek-Coder 6.7B
✓ 58 min 37.23 GB 67.1 59.8 69.9 58.1

number of test cases through the use of LLMs and mutation A. RQ1: How does the CodeACT framework perform across
strategies, resulting in a more rigorous evaluation framework. different datasets and models?
MBPP/MBPP+. The MBPP (Mostly Basic Python Pro-
gramming) benchmark includes approximately 1,000 Python We evaluate the performance and efficiency of the CodeACT
challenges, crowd-sourced to assess fundamental program- framework by training models on two distinct datasets, OSS-
ming skills and standard library usage. These challenges are Instruct and EVOL-Instruct. The focus is on assessing the
geared towards beginners and each provides a description, a framework’s impact on CodeLlama and DeepSeek-Coder mod-
solution, and three tests to verify solution accuracy. MBPP+ els, using only 40% of the available training data to highlight
extends the MBPP benchmark by incorporating a subset of efficiency gains without compromising model accuracy.
hand-verified problems from MBPP-sanitized dataset, ensuring The results presented in Table I demonstrate the effective-
that the tasks are well-defined and unambiguous, thereby ness of the CodeACT framework in enhancing both efficiency
enhancing the benchmark’s reliability and applicability in and performance metrics across various models and datasets.
more rigorous evaluations. For the OSS-Instruct dataset, the CodeACT framework signifi-
cantly reduces training times and peak GPU memory usage for
C. Implementation Details both CodeLlama and DeepSeek-Coder models. Specifically,
We employ three base models for our experiments, in- the CodeLlama-7B model trained with CodeACT shows a 71%
cluding DeepSeek-Coder-Base-6.7B, CodeLlama-Python-7B, reduction in training time (from 220 minutes to 63 minutes)
and CodeLlama-Python-13B. All models are fine-tuned for 3 and a 48% decrease in peak GPU memory usage (from 64.28
epochs using eight NVIDIA A100-80GB GPUs through the GB to 33.21 GB), while also achieving improved performance
Fully Sharded Data Parallel (FSDP) module within PyTorch. on the HumanEval and HumanEval+ benchmarks. Similarly,
The training parameters are consistent across all models, the CodeLlama-13B model exhibits a 70% reduction in train-
with the exception of the maximum input length for the ing time and a 14% reduction in peak GPU memory usage,
CodeLlama-Python-13B model. Specifically, we use AdamW with enhanced performance on the HumanEval benchmark.
[26] as our optimizer with a learning rate of 5e-5, a Cosine The DeepSeek-Coder-6.7B model, despite its smaller size,
learning rate scheduler, and 15 warmup steps. The maximum also benefits from the CodeACT framework, achieving a 71%
sequence length is set to 4096 for both CodeLlama-Python- reduction in training time and a 45% reduction in peak GPU
7B and DeepSeek-Coder-Base-6.7B, while it is set to 2048 memory usage, with the highest scores on most benchmarks.
for CodeLlama-Python-13B. The global batch size for all For the EVOL-Instruct dataset, the CodeACT framework
experiments is set to 512. We implement a full-parameter continues to exhibit efficiency gains. The CodeLlama-7B
tuning approach throughout the training process. model trained with CodeACT reduces training time by 77%
and peak GPU memory usage by 24%, with performance
V. E XPERIMENTAL R ESULTS improvements on the MBPP+ benchmark. The CodeLlama-
In this section, we present and analyze the findings by 13B model achieves a 75% reduction in training time and a
addressing five specific research questions. 15% reduction in peak GPU memory usage, along with perfor-
TABLE II. Performance comparison of models trained with CodeACT to other models. The bold scores indicate the highest performance among models of
the same size. The results show that models trained with CodeACT outperform their base models and achieve competitive results compared to other
state-of-the-art open-source models. This underscores the effectiveness of the CodeACT framework in optimizing model performance and efficiency.

Benchmark(Pass@1 %)
Model Size Base Model Data Type Data Num
HumanEval HumanEval+ MBPP MBPP+

Closed-source Models

Gemini-Pro-1.0 - - Proprietary - 63.4 55.5 75.4 61.4


Claude-3-Opus - - Proprietary - 82.9 77.4 89.4 73.3
GPT-4-Turbo - - Proprietary - 85.4 81.7 85.7 73.3

Open-source Models

CodeLlama 34B Llama2 Proprietary - 51.8 43.9 65.4 52.6


WizardCoder-CL 34B CodeLlama EVOL-Instruct 78K 73.2 64.6 73.2 59.9

StarCoder 15B StarCoderBase Proprietary - 34.1 29.3 55.1 46.1


CodeLlama 13B Llama 2 Proprietary - 43.3 36.6 57.6 46.9
WizardCoder-SC 15B StarCoder EVOL-Instruct 78K 56.7 50.6 59.6 48.1
CodeACT-CL (ours) 13B CodeLlama EVOL-Instruct 31K 64.0 55.5 62.4 51.6

CodeLlama 7B Llama 2 Proprietary - 39.0 34.1 58.1 46.1


DeepSeek-Coder-Base 6.7B - Proprietary - 47.6 40.2 69.2 54.6
WizardCoder-CL 7B CodeLlama EVOL-Instruct 78K 50.6 45.1 58.5 49.5
CodeACT-CL (ours) 7B CodeLlama EVOL-Instruct 31K 53.0 47.0 60.7 49.9
Magicoder-DS 6.7B DeepSeek-Coder OSS-Instruct 75K 66.5 60.4 75.4 61.9
CodeACT-DS (ours) 6.7B DeepSeek-Coder OSS-Instruct 30K 68.3 61.6 75.9 61.7

mance gains on the HumanEval benchmark. The DeepSeek- 13B achieves notable gains, scoring 64.0% on HumanEval
Coder-6.7B model achieves a 78% reduction in training time and 55.5% on HumanEval+, significantly outperforming its
and a 27% reduction in peak GPU memory usage, with base model, CodeLlama-13B. Similarly, our CodeACT-DS-
notable performance improvements on the HumanEval and 6.7B also excels by achieving 68.3% on HumanEval and
HumanEval+ benchmarks. 61.6% on HumanEval+, surpassing the performance of other
These results suggest that the CodeACT framework not only open-source models with similar or larger parameter sizes,
enhances training efficiency but also maintains or improves including Magicoder-DS-6.7B and WizardCoder-SC-15B.
model performance across different datasets and models. This Additionally, our models demonstrate impressive efficiency
indicates that the framework’s ability to leverage a smaller in data utilization. Despite using fewer data samples, models
subset of training data effectively contributes to both compu- trained with CodeACT achieve better performance compared
tational savings and performance optimization. to models that utilize a larger dataset. For example, the
CodeACT-DS-6.7B model, trained on 30K samples, outper-
B. RQ2: How does the performance of models trained with forms Magicoder-DS-6.7B, which was trained on 75K sam-
CodeACT compare to other models? ples. This highlights the efficacy of the CDAS method in
We evaluate the performance of models trained with the selecting high-quality, influential data that enhances model
CodeACT framework relative to other state-of-the-art models, performance. Moreover, the performance of CodeACT-DS-
including both closed-source and open-source variants. Table 6.7B rivals that of closed-source models like Gemini-Pro-
II presents a detailed comparison across various benchmarks. 1.0, demonstrating that the gap between open-source and
The closed-source models, such as Claude-3-Opus and closed-source models can be significantly reduced using our
GPT-4-Turbo, demonstrate superior performance across most framework.
benchmarks, with GPT-4-Turbo achieving the highest scores Overall, the experimental results highlight CodeACT’s ca-
on HumanEval and HumanEval+ (85.4% and 81.7%, respec- pability to significantly enhance Code LLMs. By providing an
tively). Claude-3-Opus leads on MBPP and MBPP+ bench- efficient and effective training methodology through optimized
marks (89.4% and 73.3%, respectively). Despite these im- data selection and training processes, CodeACT enables open-
pressive results, our focus is primarily on comparing the source models to achieve or even surpass the performance of
performance of open-source models with those trained with larger, more resource-intensive models.
the CodeACT framework.
Our results show that models fine-tuned with CodeACT C. RQ3: Why use K-Means for selecting diverse data?
exhibit substantial performance improvements compared to To investigate why K-Means is preferred for selecting
their respective base models. For instance, our CodeACT-CL- diverse data in the CDAS method, we conduct experiments
TABLE III. This table presents a comparison of various diverse data
selection algorithms and their impact on the performance of the
DeepSeek-Coder-Base-6.7B model using the OSS-Instruct dataset.

Method Sampling Time HumanEval HumanEval+

K-Center Greedy 209 min 58.5% 54.3%


Graph Density 36 min 63.6% 57.9%
K-Means 0.2 min 63.4% 57.9%

using DeepSeek-Coder-Base-6.7B model on the OSS-Instruct


dataset. In this study, we compare three algorithms: K-Center
Greedy [27], Graph Density [28] and K-Means. All algorithms
are evaluated with a sampling rate set to 40%. The results are
summarized in Table III.
Fig. 3. Comparison of sampling rates and their impact on the performance
The results indicate that K-Means achieves a near-optimal of the DeepSeek-Coder-Base-6.7B model using the OSS-Instruct dataset.
balance between sampling efficiency and model performance.
Specifically, K-Means requires only 0.2 minutes for sampling,
significantly lower than both K-Center Greedy (209 minutes) D. RQ4: How should the sampling rate for CDAS be set?
and Graph Density (36 minutes). This remarkable efficiency
can be attributed to K-Means’ algorithmic simplicity and its To determine the optimal sampling rate for the CDAS
ability to converge quickly. Despite its rapid sampling time, method, we conduct experiments using the DeepSeek-Coder-
K-Means maintains competitive performance with 63.4% and Base-6.7B model on the OSS-Instruct dataset. The results,
57.9% pass rates on HumanEval and HumanEval+, respec- shown in Figure 3, compare the performance of different
tively, which are on par with the outcomes yielded by the sampling rates (from 10% to 60%) against the baseline (100%
more time-intensive Graph Density algorithm. data) on the HumanEval and HumanEval+ benchmarks.
The baseline performance, with full data, achieves 65.2% on
Both K-Center Greedy and Graph Density demonstrate HumanEval and 61.0% on HumanEval+. When reducing the
limitations in efficiency compared to K-Means. K-Center data to 10%, the performance drops significantly to 61.0% on
Greedy, while producing reasonable results with a 58.5% HumanEval and 55.5% on HumanEval+. Increasing the sam-
pass rate on HumanEval and 54.3% on HumanEval+, is pling rate to 20% improves the performance to 62.8% on Hu-
extremely inefficient with a sampling time of 209 minutes. manEval and 57.3% on HumanEval+. At a 30% sampling rate,
This inefficiency stems from its iterative nature, continually the performance further improves to 65.9% on HumanEval
seeking the point that maximizes the minimum distance to any and 59.1% on HumanEval+. Notably, with a sampling rate of
already chosen point. Graph Density, though more efficient 40%, the model achieves the highest performance, surpassing
than K-Center Greedy, still requires 36 minutes for sampling. the baseline with scores of 68.3% on HumanEval and 61.6%
It achieves the highest pass rate of 63.6% on HumanEval on HumanEval+.
and ties with K-Means at 57.9% on HumanEval+, but the However, increasing the sampling rate beyond 40% results
marginal performance gains do not justify the significantly in a decline in performance. At a 50% sampling rate, the
longer sampling time. Both methods become computationally model’s performance decreases to 65.9% on HumanEval and
demanding as dataset size increases, with K-Center Greedy’s 59.1% on HumanEval+. Similarly, a 60% sampling rate sees
approach becoming prohibitively slow and Graph Density re- further declines to 65.2% on HumanEval. These results suggest
quiring pairwise distance calculations that scale quadratically that overly large sampling rates may include more redundant
with the number of data points. These characteristics limit their or less informative data, which does not contribute to and may
applicability in real-world scenarios where time constraints even hinder model performance.
and large-scale datasets are common. The experimental results indicate that a sampling rate of
In conclusion, the superior efficiency and comparable per- 40% not only maintains the model’s performance but actually
formance of K-Means make it the optimal choice for selecting improves it compared to using the entire dataset. This im-
diverse data. Its simplicity and effectiveness in clustering data provement can be attributed to the CDAS method’s ability to
points based on similarity ensure a diverse dataset without the effectively select diverse and complex data, which enhances
computational overhead associated with the other algorithms. the model’s generalization capabilities and training efficiency.
This comprehensive approach enables the model to generalize Additionally, using 40% of the data significantly reduces the
better and perform more effectively on coding tasks, validating computational resources and training time required, making it
K-Means as a superior sampling strategy for optimizing model a more efficient choice.
training. In conclusion, based on the observed performance gains and
model, capable of handling a wide array of programming
challenges with improved accuracy and efficiency.

VI. D ISCUSSION
A. Threats to Validity
Scope of Model Sizes. The current study has primarily
evaluated the CodeACT framework on models with 7B and
13B parameters. While the results demonstrate significant
improvements in both performance and efficiency, it is cru-
cial to validate the framework’s effectiveness on larger-scale
models. Testing CodeACT on models with greater parameter
sizes, such as those with 30B or more parameters, would
provide deeper insights into its scalability and generalizability.
Fig. 4. Comparison of sampling methods and their impact on the performance Evaluating the framework on larger models will also help
of the DeepSeek-Coder-Base-6.7B model using the OSS-Instruct dataset. ascertain whether the observed benefits in smaller models hold
true in more complex architectures.
Scope of Code-Related Tasks. The current study has
efficiency improvements, a 40% sampling rate is determined focused exclusively on code generation tasks. However, the
to be the optimal setting for the CDAS method. This rate efficacy of the CodeACT framework should extend to a wider
provides a balanced trade-off between maintaining high model array of code-related tasks, such as bug fixing [29]–[32] and
performance and reducing computational overhead, thereby code summarization [33]–[35]. Future research should aim to
validating its selection as the final sampling rate for our investigate the applicability and effectiveness of CodeACT
experiments. in these additional domains to provide a more comprehen-
sive assessment of its capabilities. Evaluating CodeACT on
E. RQ5: How does CDAS compare to other sampling methods bug fixing tasks would examine its capacity to identify and
in terms of performance? correct errors in code, thereby enhancing code reliability.
To investigate the effectiveness of different sampling meth- Code summarization tasks, which involve generating concise
ods, we conduct experiments using the DeepSeek-Coder-Base- descriptions of code functionality, would further demonstrate
6.7B model on the OSS-Instruct dataset. The sampling tech- the framework’s potential to aid in code comprehension and
niques examined include random sampling, complexity sam- documentation. Extending the evaluation to these diverse tasks
pling, diversity sampling, and our innovative CDAS method. would offer a holistic view of CodeACT’s utility in the broader
Complexity sampling involves selecting the top m% of data software engineering landscape.
based on IFD scores. Diversity sampling employs K-Means
clustering to randomly extract m% of data from each cluster. B. Limitation of CodeACT
All strategies are evaluated with a sampling rate set to 40%. While the CodeACT framework has demonstrated effec-
The results are summarized in Figure 4. tiveness in selecting code data by considering both com-
The results demonstrate that CDAS outperforms the other plexity and diversity, there are inherent limitations that need
methods across all benchmarks, achieving the highest scores. to be addressed. One significant limitation is the challenge
Notably, the performance improvement of CDAS over other of ensuring the correctness of complex code. Although the
sampling methods highlights the effectiveness of integrating CDAS algorithm has proven to be effective, selecting complex
both complexity and diversity in the data selection process. data often brings about issues related to the correctness and
Random sampling and diversity sampling yield relatively reliability of the code samples. Complex code, by its nature, is
lower performance, indicating that simply ensuring data va- more prone to errors and inconsistencies, which can adversely
riety without considering complexity is insufficient. Similarly, affect the quality of training and the performance of the
complexity sampling alone does not achieve optimal results, resulting models.
underscoring the importance of a balanced approach that con- Ensuring the correctness of complex code data is a crucial
siders both the complexity and diversity of training samples. aspect that CodeACT currently does not fully address. The
By considering both complexity and diversity, CDAS en- inclusion of erroneous or unreliable code can lead to subopti-
sures that the selected data is not only challenging but also rep- mal model training, ultimately impacting the model’s ability to
resentative of a broad range of programming scenarios. This generate accurate and functional code. Future research should
comprehensive approach enables the model to generalize better focus on integrating mechanisms to validate and ensure the
and perform more effectively on diverse coding tasks, thus correctness of the selected code data. This could involve
validating CDAS as a superior sampling method for optimizing incorporating automated testing, static analysis, or leveraging
model training. The synergistic effect of combining complexity additional LLMs to verify the accuracy and functionality of
and diversity in CDAS leads to a more robust and versatile the complex code samples before they are used for training.
By addressing this limitation, we aim to enhance the ro- C. Data Selection
bustness and reliability of the CodeACT framework. Ensuring
The process of manual data curation is not only costly but
the correctness of complex code will not only improve the
also susceptible to subjective bias, rendering the development
quality of the training data but also contribute to the overall
of automated data selection methods critically important. Cur-
effectiveness of the model in real-world coding tasks. This
rent automated data selection methods are primarily divided
aspect will be a key focus in our future research efforts.
into two categories: those that rely on external models for data
VII. R ELATED W ORK selection and those that do not.
A. Base Code LLMs In the realm of methods dependent on external models, for
instance, AlpaGasus [39] utilizes meticulously crafted prompt
The advancement of LLMs has significantly impacted var-
templates to leverage ChatGPT for scoring data quality. InsTag
ious domains, including code generation and understanding.
[40] employs ChatGPT to obtain detailed labels for each data
Closed-source models, such as GPT-4 [8], have consistently
instruction, assessing the complexity and diversity of the data
ranked highly on mainstream evaluation metrics, demonstrat-
based on these labels. LIFT [41] generates a diverse set of
ing superior performance in code-related tasks. These models
instructions using GPT-4 to augment the dataset, followed by
leverage extensive resources and proprietary data, resulting
vectorization and selection of subsets based on row variables,
in a performance gap between them and their open-source
ultimately utilizing GPT-4 for multi-dimensional scoring of
counterparts.
the data. While these methods have demonstrated efficacy
To bridge this gap and democratize access to advanced
in handling large-scale datasets, they also incur significant
coding capabilities, several open-source models have been de-
economic costs.
veloped. Notable among these are CodeLlama [2], DeepSeek-
Coder [3], and CodeGemma [36]. CodeLlama, derived from In contrast, methods independent of external models, such
the Llama 2 [37] architecture, is specifically fine-tuned for as DQ [42], integrate techniques of data distillation and
code generation tasks and has shown competitive results coreset selection [43]. The core technology involves defining
compared to closed-source models. These open-source models a gain function to iteratively partition the dataset and select
have significantly propelled the field of code generation for- representative samples, thereby maximizing data diversity.
ward. They offer robust alternatives to closed-source models, Cherry LLM [20] introduces the Instruction-Following Dif-
promoting innovation and collaboration within the research ficulty score, determined by comparing the cross-entropy loss
community. The continuous development and refinement of of model-generated responses with and without instructions. A
these models are crucial for narrowing the performance dis- high IFD score implies that the model struggles to accurately
parity and advancing the state-of-the-art in code-related appli- align answers with instructions, reflecting the complexity of
cations. the instructions.
Despite the progress made in automated data curation,
B. Data Generation methods specifically tailored for code data selection remain
A significant area of research focuses on generating in- notably absent from the literature.
structional data to fine-tune base LLMs. Self-Instruct [38] is
one such method that refines weaker student models by using VIII. C ONCLUSION
strong teacher models to generate synthetic instructions. This
approach leverages the expertise of advanced LLMs to produce In this paper, we introduce the CodeACT framework, de-
diverse and complex instructional data, which in turn helps in signed to optimize the training of Code LLMs by addressing
training more effective student models. Evol-Instruct [6] takes both data quality and computational efficiency. CodeACT
this a step further by iteratively increasing the complexity integrates the CDAS method for selecting complex and diverse
of the coding instructions. This method involves evolving data, and the Dynamic Pack padding strategy to minimize
instructions to be more challenging over multiple iterations, padding tokens and reduce resource consumption. Our exper-
thereby improving the model’s ability to handle complex tasks. imental results demonstrate that CodeACT-DeepSeek-Coder-
Another notable approach is OSS-Instruct [7], which generates 6.7B, fine-tuned on only 40% of the EVOL-Instruct data,
realistic coding problems based on open-source code snippets. achieves a significant performance increase on HumanEval by
By extracting real-world code segments from repositories such 8.6%, a reduction in training time by 78%, and a decrease in
as GitHub, OSS-Instruct prompts LLMs to create relevant peak GPU memory usage by 27%. These findings validate the
and practical coding challenges, ensuring the generated data effectiveness of the CodeACT framework in improving both
closely mirrors actual programming scenarios. the performance and efficiency of Code LLMs.
These methodologies typically utilize more powerful LLMs, Future work should focus on ensuring the correctness of
such as GPT-4, to generate vast amounts of synthetic data. complex code data. While CDAS effectively selects influential
While these approaches address the issue of data quantity, data, it will be crucial to integrate mechanisms for validating
they often overlook data quality. Ensuring the relevance, and improving the accuracy of complex instructions. This will
diversity, and complexity of the generated data remains a further strengthen the CodeACT framework, making it more
critical challenge for further enhancing model performance. effective in handling diverse and complex coding tasks.
R EFERENCES [19] M. Li, L. Chen, J. Chen, S. He, H. Huang, J. Gu, and T. Zhou,
“Reflection-tuning: Data recycling improves llm instruction-tuning,”
[1] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, arXiv preprint arXiv:2310.11716, 2023.
H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large [20] M. Li, Y. Zhang, Z. Li, J. Chen, L. Chen, N. Cheng, J. Wang, T. Zhou,
language models trained on code,” arXiv preprint arXiv:2107.03374, and J. Xiao, “From quantity to quality: Boosting llm performance with
2021. self-guided data selection for instruction tuning,” in Proceedings of the
[2] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, 2024 Conference of the North American Chapter of the Association for
J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models Computational Linguistics: Human Language Technologies (Volume 1:
for code,” arXiv preprint arXiv:2308.12950, 2023. Long Papers), 2024, pp. 7595–7628.
[3] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, [21] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings
X. Bi, Y. Li et al., “Deepseek-coder: When the large language model using siamese bert-networks,” in Proceedings of the 2019 Conference
meets programming–the rise of code intelligence,” arXiv preprint on Empirical Methods in Natural Language Processing. Association
arXiv:2401.14196, 2024. for Computational Linguistics, 11 2019. [Online]. Available: https:
[4] A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, //arxiv.org/abs/1908.10084
A. Tang, D. Pykhtar, J. Liu, Y. Wei et al., “Starcoder 2 and the stack [22] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan,
v2: The next generation,” arXiv preprint arXiv:2402.19173, 2024. H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large
[5] S. Chaudhary, “Code alpaca: An instruction-following llama model for language models trained on code,” arXiv preprint arXiv:2107.03374,
code generation,” https://github.com/sahil280114/codealpaca, 2023. 2021.
[6] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, [23] J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code
J. Ma, Q. Lin, and D. Jiang, “Wizardcoder: Empowering code large generated by chatGPT really correct? rigorous evaluation of large
language models with evol-instruct,” in The Twelfth International language models for code generation,” in Thirty-seventh Conference
Conference on Learning Representations, 2024. [Online]. Available: on Neural Information Processing Systems, 2023. [Online]. Available:
https://openreview.net/forum?id=UnUwSIgK5W https://openreview.net/forum?id=1qvx610Cu7
[7] Y. Wei, Z. Wang, J. Liu, Y. Ding, and L. Zhang, “Magicoder: Source [24] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan,
code is all you need,” arXiv preprint arXiv:2312.02120, 2023. E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large
[8] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, language models,” arXiv preprint arXiv:2108.07732, 2021.
D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 [25] X. Chen, M. Lin, N. Schärli, and D. Zhou, “Teaching large
technical report,” arXiv preprint arXiv:2303.08774, 2023. language models to self-debug,” in The Twelfth International
[9] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, Conference on Learning Representations, 2024. [Online]. Available:
L. Yu et al., “Lima: Less is more for alignment,” Advances in Neural https://openreview.net/forum?id=KuPixIqPiq
Information Processing Systems, vol. 36, 2024. [26] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in
[10] W. Liu, W. Zeng, K. He, Y. Jiang, and J. He, “What makes good data International Conference on Learning Representations, 2019. [Online].
for alignment? a comprehensive study of automatic data selection in Available: https://openreview.net/forum?id=Bkg6RiCqY7
instruction tuning,” in The Twelfth International Conference on Learning [27] O. Sener and S. Savarese, “Active learning for convolutional neural net-
Representations, 2024. works: A core-set approach,” in International Conference on Learning
[11] A. Bukharin and T. Zhao, “Data diversity matters for robust instruction Representations, 2018.
tuning,” arXiv preprint arXiv:2311.14736, 2023. [28] S. Ebert, M. Fritz, and B. Schiele, “Ralf: A reinforced active learning
[12] X. Ni, Y. Gong, Z. Gou, Y. Shen, Y. Yang, N. Duan, and W. Chen, formulation for object class recognition,” in 2012 IEEE Conference on
“Exploring the mystery of influential data for mathematical reasoning,” Computer Vision and Pattern Recognition. IEEE, 2012, pp. 3626–3633.
arXiv preprint arXiv:2404.01067, 2024. [29] R. Gupta, S. Pal, A. Kanade, and S. Shevade, “Deepfix: Fixing com-
[13] Y. Ge, Y. Liu, C. Hu, W. Meng, S. Tao, X. Zhao, H. Ma, L. Zhang, mon c language errors by deep learning,” in Proceedings of the aaai
H. Yang, and T. Xiao, “Clustering and ranking: Diversity-preserved conference on artificial intelligence, vol. 31, no. 1, 2017.
instruction selection through expert-aligned quality estimation,” arXiv [30] D. A. Tomassi, N. Dmeiri, Y. Wang, A. Bhowmick, Y.-C. Liu, P. T.
preprint arXiv:2402.18191, 2024. Devanbu, B. Vasilescu, and C. Rubio-González, “Bugswarm: Mining
[14] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Naik, and continuously growing a dataset of reproducible failures and fixes,” in
A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap, E. Pathak, 2019 IEEE/ACM 41st International Conference on Software Engineering
G. Karamanolakis, H. Lai, I. Purohit, I. Mondal, J. Anderson, K. Kuznia, (ICSE). IEEE, 2019, pp. 339–349.
K. Doshi, K. K. Pal, M. Patel, M. Moradshahi, M. Parmar, M. Purohit, [31] W. Oh and H. Oh, “Pyter: effective program repair for python type
N. Varshney, P. R. Kaza, P. Verma, R. S. Puri, R. Karia, S. Doshi, errors,” in Proceedings of the 30th ACM Joint European Software
S. K. Sampat, S. Mishra, S. Reddy A, S. Patro, T. Dixit, and X. Shen, Engineering Conference and Symposium on the Foundations of Software
“Super-NaturalInstructions: Generalization via declarative instructions Engineering, ser. ESEC/FSE 2022. New York, NY, USA: Association
on 1600+ NLP tasks,” in Proceedings of the 2022 Conference on for Computing Machinery, 2022, p. 922–934. [Online]. Available:
Empirical Methods in Natural Language Processing, Y. Goldberg, https://doi.org/10.1145/3540250.3549130
Z. Kozareva, and Y. Zhang, Eds. Abu Dhabi, United Arab Emirates: [32] R. Tian, Y. Ye, Y. Qin, X. Cong, Y. Lin, Z. Liu, and M. Sun, “De-
Association for Computational Linguistics, Dec. 2022, pp. 5085–5109. bugbench: Evaluating debugging capability of large language models,”
[Online]. Available: https://aclanthology.org/2022.emnlp-main.340 arXiv preprint arXiv:2401.04621, 2024.
[15] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, D. Zhou, [33] X. HU, G. LI, X. XIA, D. LO, S. LU, and Z. JIN, “Summarizing
Q. V. Le, B. Zoph, J. Wei et al., “The flan collection: Designing data and source code with transferred api knowledge.(2018),” in Proceedings of
methods for effective instruction tuning,” in International Conference on the Twenty-Seventh International Joint Conference on Artificial Intelli-
Machine Learning. PMLR, 2023, pp. 22 631–22 648. gence (IJCAI 2018), Stockholm, Sweden, 2018, pp. 13–19.
[16] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, [34] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt,
and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama “Codesearchnet challenge: Evaluating the state of semantic code search,”
model,” https://github.com/tatsu-lab/stanford alpaca, 2023. arXiv preprint arXiv:1909.09436, 2019.
[17] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, [35] N. Muennighoff, Q. Liu, A. Zebaze, Q. Zheng, B. Hui, T. Y. Zhuo,
L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, S. Singh, X. Tang, L. Von Werra, and S. Longpre, “Octopack: Instruction
and E. P. Xing, “Vicuna: An open-source chatbot impressing gpt- tuning code large language models,” arXiv preprint arXiv:2308.07124,
4 with 90%* chatgpt quality,” March 2023. [Online]. Available: 2023.
https://lmsys.org/blog/2023-03-30-vicuna/ [36] C. Team, “Codegemma: Open code models based on gemma,” arXiv
[18] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, Q. Lin, preprint arXiv:2406.11409, 2024.
and D. Jiang, “WizardLM: Empowering large pre-trained language [37] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei,
models to follow complex instructions,” in The Twelfth International N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama
Conference on Learning Representations, 2024. [Online]. Available: 2: Open foundation and fine-tuned chat models,” arXiv preprint
https://openreview.net/forum?id=CfXh93NDgH arXiv:2307.09288, 2023.
[38] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi,
and H. Hajishirzi, “Self-instruct: Aligning language models with self-
generated instructions,” arXiv preprint arXiv:2212.10560, 2022.
[39] L. Chen, S. Li, J. Yan, H. Wang, K. Gunaratna, V. Yadav, Z. Tang,
V. Srinivasan, T. Zhou, H. Huang et al., “Alpagasus: Training a better
alpaca with fewer data,” in The Twelfth International Conference on
Learning Representations, 2024.
[40] K. Lu, H. Yuan, Z. Yuan, R. Lin, J. Lin, C. Tan, C. Zhou,
and J. Zhou, “#instag: Instruction tagging for analyzing supervised
fine-tuning of large language models,” in The Twelfth International
Conference on Learning Representations, 2024. [Online]. Available:
https://openreview.net/forum?id=pszewhybU9
[41] Y. Xu, Y. Yao, Y. Huang, M. Qi, M. Wang, B. Gu, and N. Sundaresan,
“Rethinking the instruction quality: Lift is what you need,” arXiv
preprint arXiv:2312.11508, 2023.
[42] D. Zhou, K. Wang, J. Gu, X. Peng, D. Lian, Y. Zhang, Y. You,
and J. Feng, “Dataset quantization,” in Proceedings of the IEEE/CVF
International Conference on Computer Vision, 2023, pp. 17 205–17 216.
[43] R. Iyer, N. Khargoankar, J. Bilmes, and H. Asanani, “Submodular com-
binatorial information measures with applications in machine learning,”
in Algorithmic Learning Theory. PMLR, 2021, pp. 722–754.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy