Large Language Models As Tool Makers
Large Language Models As Tool Makers
Tianle Cai1,2∗ Xuezhi Wang1 Tengyu Ma1,3† Xinyun Chen1 Denny Zhou1
1
Google Deepmind 2 Princeton University 3 Stanford University
arXiv:2305.17126v1 [cs.LG] 26 May 2023
Abstract
1 Introduction
Large language models (LLMs) have demonstrated outstanding capabilities across a broad array of
NLP tasks [Brown et al., 2020, Chowdhery et al., 2022, Zhang et al., 2022, Hoffmann et al., 2022,
OpenAI, 2023, Google, 2023] and have even shown promising signs of achieving certain aspects
of artificial general intelligence [Bubeck et al., 2023, Kosinski, 2023]. Moreover, analogous to the
evolution of human intelligence, recent research has unveiled the potential of augmenting LLMs with
external tools, thereby significantly enhancing their problem-solving capacities and efficiencies [Yao
et al., 2023, Liu et al., 2023, Parisi et al., 2022, Schick et al., 2023].
However, the applicability of these tool-using methods is largely contingent on the availability of
suitable tools. According to the lessons learned from the evolutionary milestones of humans, a
crucial turning point was that humans got the ability to fabricate their own tools to address emerging
challenges. Inspired by the importance of tool-making for humans, in this work, we embark on an
initial exploration to apply this evolutionary concept to the realm of LLMs. We propose a closed-loop
framework, which we term as LLMs As Tool Makers (LATM), enables LLMs to generate their own
∗
Work done as a Student Researcher at Google Deepmind.
†
Work done as a Visiting Researcher at Google Deepmind.
Code available at https://github.com/ctlllll/LLM-ToolMaker.
2
2 Related Work
Chain of thought (CoT). Recently, significant progress has been made in enhancing the problem-
solving abilities of large language models (LLMs) for complex tasks. For instance, CoT prompt-
ing [Wei et al., 2022, Wang et al., 2022] has been proposed to bolster LLM reasoning capabilities,
demonstrating improved performance across various reasoning and natural language processing tasks.
CoT is typically articulated through natural languages [Ling et al., 2017, Cobbe et al., 2021, Suzgun
et al., 2022, Shi et al., 2022, Zhou et al., 2022], yet it might also be effectively represented using
programming languages [Amini et al., 2019, Austin et al., 2021, Nye et al., 2021, Chowdhery et al.,
2022, Gao et al., 2023, Chen et al., 2022]. More recently, Arora et al. [2023] proposed using LLMs
to generate structured views over documents, balancing quality and cost by ensembling extractions
from multiple synthesized functions. Our method shares a similar spirit with Arora et al. [2023] in
managing cost and quality trade-offs but focuses on more general use cases.
Augmenting language models with tools. Recent works have explored the potential of using
external tools to supplement LLMs’ capabilities for complex tasks. Yao et al. [2023], Yang et al.
[2023] proposed augmenting reasoning traces with task-specific actions in LLMs, enabling models to
reason and act synergistically. Various studies [Liu et al., 2023, Parisi et al., 2022, Schick et al., 2023,
Shen et al., 2023, Lu et al., 2023, Paranjape et al., 2023, Liang et al., 2023] have demonstrated that
supplementing LLMs with tools, such as calculators, search engines, translation systems, calendars,
or even API calls on other models, can help solve tasks that are not easily addressed by LLMs alone.
Similar to LATM, methods like Chameleon [Lu et al., 2023] also incorporate Python executors in the
pipeline. However, their primary focus is on using Python executors to accurately solve sub-steps
involving arithmetic reasoning, similar to Gao et al. [2023], Chen et al. [2022]. In contrast, we
use Python executors to create reusable tools for addressing other task instances. Furthermore, the
separation of the tool maker and tool user enables the use of a lightweight model for most inferences,
thus enhancing efficiency and cost-effectiveness in LATM.
Adaptive generation in language models. In addition, recent research has proposed methods to
adaptively control decoding in LLMs to improve text generation efficiency [Leviathan et al., 2022,
Chen et al., 2023a, Xia et al., 2023]. Speculative decoding is based on the notion that generating
text tokens (a more expensive process) can be expedited with a faster yet less powerful model while
approximating the performance of larger, costlier models by using them to score generated tokens (a
much faster process). Our approach of passing tools from a more expensive model to a smaller, faster
model also shares a similar spirit of adaptive computing. Instead of altering the decoding procedure,
we transfer newly generated tools between models to boost both the performance and efficiency of an
LLM in solving tasks.
Language model cascades. There is recent evidence that LLMs can enable repeated interactions
and that multiple LLMs can be combined to extend their capabilities further [Wu et al., 2022,
Zhou et al., 2022, Dohan et al., 2022, Chen et al., 2023c]. Also, Chen et al. [2023b] demonstrated
that identifying optimal LLM combinations can help reduce costs while improving accuracy. Our
motivation aligns with these findings; however, rather than merely cascading LLMs, we identify task
categories that can be better addressed using new tools generated by a larger model and assign each
individual inference within that task category to a smaller model.
In the LATM paradigm, the main process can be split into two stages: Tool Making and Tool Using.
Each stage utilizes different types of Large Language Models (LLMs) to balance performance and
cost-effectiveness. All the prompts used in our experiments are shown in Appendix B.
Tool Making. This stage employs a powerful yet more expensive model, such as GPT-4, to serve
as the tool maker. Tool maker’s role is to create a generic and reusable tool (implemented as a Python
function) from a few demonstrations of a task. This stage can be further divided into three sub-stages:
3
Figure 2: The pipeline of LATM. LATM can be divided into two stages: 1) tool making: a powerful
yet more expensive model serves as the tool maker to generate generic and reusable tools from a
few demonstrations; 2) tool using: a lightweight and cheaper model serves as the tool user to use
the tool to solve various instances of the task. The tool-making stage can be further divided into
three sub-stages: (i) tool proposing: the tool maker makes an attempt to generate the tool (Python
function) from a few training demonstrations, if the tool is not executable, report the error and
generate a new one (fix the issues in the function); (ii) tool verification: the tool maker runs unit tests
on validation samples, if the tool does not pass the tests, report the error and generate new tests (fix
the issues in function calls in unit tests); and (iii) tool wrapping: wrapping up the function code and
the demonstrations of how to convert a question into a function call from unit tests, preparing usable
tools for tool user.
• Tool Proposing: In this stage, tool maker attempts to generate a Python function to solve the
demonstrations from the given task. This process follows the “programming by example” (PbE)
paradigm [Halbert, 1984] where several concrete demonstrations are provided, and the model is
required to write programs that produce the demonstrated behaviors. In our experiments, we use
3 demonstrations for this stage. If the proposed tool is unexecutable or encounters errors, tool
maker appends the error messages to the history and makes another attempt.
• Tool Verification: In this stage, the tool maker generates unit tests using validation samples and
subsequently executes these tests on the proposed tool. We utilize 3 validation samples in our
experiments. If the tool fails any of these tests, the tool maker records the error in its history
and makes an attempt to rectify the issues within the unit tests (this procedure will only correct
the function calls in the unit test part and will not correct the function). The ability of LLMs
to self-debug has been demonstrated effectively in recent research [Madaan et al., 2023, Chen
et al., 2023c, Lu et al., 2023]. However, within the LATM pipeline, the verification stage serves a
slightly different usage. This stage fulfills two key roles: 1) it provides examples that demonstrate
how to convert natural language questions into function calls, and 2) it verifies the tool’s reliability,
enabling the entire process to be fully automated.
• Tool Wrapping: If the execution or verification fails over a preset threshold, the Tool Making
stage is viewed as failed. Otherwise, tool maker is ready to prepare the wrapped tool for tool user.
This step involves wrapping up the function code and providing demonstrations of how to convert
4
a task into a function call. These demonstrations are extracted from the Tool Verification step,
which converts questions into unit tests. This final product is then ready for use by the tool user.
Please see Appendix C for examples of the wrapped tools.
Tool Using. This second stage involves a lightweight and cost-effective model, such as GPT-3.5
Turbo, to serve as the tool user. The tool user’s role is to utilize the verified tool to solve various
instances of the task. The prompt for this stage is the wrapped tool which contains the function for
solving the task and demonstrations of how to convert a task query into a function call. With the
demonstrations, tool user can then generate the required function call in an in-context learning fashion.
The function calls are then executed to solve the task. Optionally, postprocessing can be applied
to convert the output to match the required format of the task, such as options for multiple-choice
questions.
The tool-making stage, including tool proposing, verification, and wrapping, only needs to be
performed once for each type of task. The resulting tools can then be reused for all instances of
that task. This makes LATM significantly more efficient and cost-effective than using a powerful
model alone. Furthermore, the Python function tools are a more generic form of Chain-of-Thought,
enhancing the overall utility and flexibility of the LLMs, as they can be used to solve questions that
involve algorithmic reasoning ability [Veličković and Blundell, 2021].
To illustrate our methodology, Figure 3 provides a concrete example of how the tool maker solves
the logical deduction task from BigBench [Srivastava et al., 2022] by producing a tool (a Python
function), and how the tool user utilize the tool. This task requires inferring the ordering of five
objects and then answering a question. The conditions include both relative positions of certain object
pairs and the absolute positions of some objects, as demonstrated in the “Tool maker input” block
in Figure 3. To solve this task, the tool maker, e.g., GPT-4, generates a generic program that solves
the task by extracting constraints from the question and then searching over all permutations for the
result. The tool user, e.g., GPT-3.5 Turbo, can then utilize this program to solve the task, using a
function call that merely extracts relevant information from the natural language instances of the task.
We show more examples of the generated new tools for solving other tasks in Appendix C.
In real-world scenarios, task instances typically arrive in sequence. To accommodate this stream of
data, we introduce a third LLM, the dispatcher, which determines whether to engage the tool user
or tool maker for each incoming task. This module bears similarities to the tool selection feature
present in existing works [Lu et al., 2023, Shen et al., 2023, Schick et al., 2023, Paranjape et al.,
2023]. However, our dispatcher is distinct in its ability to identify new tasks that cannot be addressed
by existing tools and to engage the tool maker to generate new tools for these tasks.
Specifically, the dispatcher maintains a record of existing tools produced by the tool maker. When a
new task instance is received, the dispatcher initially determines if there is a suitable tool for the task
at hand. If a suitable tool exists, the dispatcher passes the instance and its corresponding tool to the
tool user for task resolution. If no appropriate tool is found, the dispatcher identifies the instance
as a new task and solves the instance with a powerful model or even invokes a human labeler. The
instances from a new task are then cached until sufficient cached instances are available for the tool
maker to make a new tool. The dispatcher’s workflow is illustrated in Figure 4. Given the simplicity
of the dispatching task, the dispatcher can be a lightweight model equipped with proper prompts (See
Appendix B), which adds only a marginal cost to the overall pipeline.
4 Experiments
4.1 Experimental Setup
Datasets. We evaluate our approach on six datasets from diverse domains, including Logical
Deduction, Tracking Shuffled Objects, Dyck Language, Word Sorting, Chinese Remainder Theorem,
and Scheduling Meeting. The first five datasets are sourced from BigBench [Srivastava et al., 2022].
We take the 5 objects version of the Logical Deduction and Tracking Shuffled Objects tasks, referred
to as Logical Deduction (5) and Tracking Shuffled Objects (5) in the paper. We also constructed the
Scheduling Meeting task to demonstrate the effectiveness of LATM in real-world scenarios. Detailed
5
Figure 3: An illustration of the Tool Proposing and Tool Using stages of the LATM pipeline
for the Logical Deduction task [Srivastava et al., 2022]. This task requires determining the order
of five objects based on several given conditions. In the Tool Proposing stage, the tool maker (such
as GPT-4) formulates a generic Python function capable of solving the provided k demonstrations
from the task (where k equals 3 in our experiments). The tool maker generates a search algorithm
that enumerates all possible orderings and verifies each against the provided conditions. During the
tool-using stage, the tool user translates each natural language question into a series of conditions,
generating function calls to utilize the tool for each task instance.
Table 1: The utility functions generated by tool maker to solve the tasks.
information on dataset generation can be found in Appendix D. We divide each dataset into training,
validation, and test sets, containing 3, 3, and 240 instances, respectively.
Model settings. During the tool-making stage, we set the temperature to 0.3 to introduce ran-
domness to the generation process, allowing for retries if necessary. For this stage, we conduct
experiments using GPT-4 and GPT-3.5 Turbo models with the ChatCompletion API, always ap-
pending the response to the chat history to create an interactive experience. In the tool-using stage,
the LLM API call is made only once, and we also perform ablation studies on GPT-3-type models
with the standard Completion API. When using the tools, we consistently set the temperature to 0.0.
We set the maximal retry times to be 3 for the tool-proposing and tool-verification stages.
In the tool-making stage, we use a powerful yet slower model to generate generic Python functions
tailored to a specific task. This step is performed only once for each task, and the overhead is
amortized across all instances of that task. In our experiments, we use GPT-4 [OpenAI, 2023] as a
representative tool maker, while we explore other models’ tool-making capabilities in Section 4.5.
We provide several few-shot exemplars for the language model, guiding it to generate generic Python
programs, as illustrated in Figure 3.
6
Figure 4: An illustration of the Dispatcher. In an online setting where task instances arrive
sequentially, the dispatcher, a lightweight model, assesses each incoming instance. If a suitable tool
already exists to tackle the task, the dispatcher selects this tool and forwards the task instance to the
tool user for resolution. If no suitable tool is found, the dispatcher routes the task instance to the tool
maker to create a new tool that can be used by tool user later.
Our observations indicate that when GPT-4 is employed as the tool maker, the model frequently
devises suitable algorithms for solving tasks. For instance, as shown in Table 1, the tool maker
creates code to solve the logical deduction task by searching through all permutations and selecting
the correct one that satisfies the given constraints. In our experiment, the tool-verification stage is
mainly used to provide examples that demonstrate how to convert natural language questions into
function calls, and we only observe 2 cases out of the 60 trials that the tool maker can correct its
mistakes with the guide of error messages. See Section 4.5 for more discussions on the tool maker.
In Table 2, we compare the performance of Chain-of-Thought prompting [Wei et al., 2022] with our
method, LATM. We employ GPT-4 as the tool maker to generate tools for the six tasks, and evaluate
the performance of both GPT-3.5 Turbo and GPT-4 as tool user. The results demonstrate that with the
help of the tool, a lightweight model like GPT-3.5 Turbo can achieve performance on par with GPT-4,
significantly outperforming CoT prompting. Additionally, the average cost of using GPT-3.5 Turbo
with the tool is much lower compared to using GPT-4. This highlights the effectiveness of LATM
in enhancing the performance of lightweight models and therefore reducing the cost compared to
employing expensive models. Intriguingly, for the Dyck Language task, GPT-3.5 Turbo as the tool
user even surpasses GPT-4 in its role as the tool user. Upon investigating the failure cases, we find
that when converting the question into a function call, GPT-4 occasionally solves part of the problem
unnecessarily, which leads to incorrect function output.
Tool User Logical Tracking Shuffled Dyck Word Chinese Schedule Cost on
Method
Model Deduction (5) Objects (5) Language Sorting Remainder Theorem Meeting n samples
GPT-3.5 CoT 66.4 61.6 20.4 59.2 0.0 18.9 O(nc)
Turbo LATM 79.7 (+13.3) 99.6 (+38.0) 92.2 (+71.8) 98.3 (+39.1) 100.0 (+100.0) 100.0 (+81.1) O(nc + C)
CoT 88.8 100.0 63.6 90.9 0.0 55.6 O(nC)
GPT-4
LATM 86.6 100.0 87.5 99.1 100.0 100.0 O(nC)
Table 2: Performance comparison between LATM and Chain-of-Thought (CoT) [Wei et al.,
2022]. The six tasks are detailed in Section 4.1. For LATM, the tool is created by GPT-4 and
utilized by both GPT-3.5 Turbo and GPT-4. The results demonstrate that the application of LATM
can significantly enhance the performance of GPT-3.5 Turbo, often surpassing or matching GPT-4’s
performance with CoT in certain scenarios. The last column depicts the overall cost of processing n
samples. Here, C represents the cost of one call to GPT-4, while c denotes the cost of one call to
GPT-3.5 Turbo. At the time of writing this paper, C is over 15x larger than c. The few-shot CoT
demonstrations for the first four tasks are provided by Suzgun et al. [2022], while for the last two
tasks, we apply direct few-shot prompting without CoT.
7
Logical Tracking Shuffled Dyck Word Chinese Schedule
Tool Maker Model
Deduction (5) Objects (5) Language Sorting Remainder Theorem Meeting
GPT-3.5 Turbo 0/5 0/5 5/5 5/5 5/5 0/5
GPT-4 3/5 4/5 5/5 5/5 5/5 3/5
Table 3: Success rate of generating new tools (Python functions that pass the tool-verification
step) in the tool-making stage with GPT-4 v.s. GPT-3.5 Turbo. We run 5 trials for each model on
each task, n/5 means n trails out of 5 successes to produce a valid tool. For hard tasks like Logical
Deduction and Tracking Shuffled Objects, GPT-3.5 Turbo fails in all trials, showing the necessity of
using a more powerful model as tool maker.
As mentioned in Section 3.2, we can extend LATM to a streaming setting where instances from
(potentially) different tasks arrive on-the-fly. In this case, we require another model, the dispatcher,
to determine the task to which the instance belongs. We use GPT-3.5 Turbo as the dispatcher and
evaluate its ability to: 1) identify existing tools to solve an incoming instance; 2) request tool-making
for instances from an unseen task.
Identifying existing tools. We first assess the ability of the dispatcher to identify existing tools
for a given instance. We randomly mix the six tasks from Section 4.1 and generate a test set with
100 samples. For each instance in the test set, we use the dispatcher to identify the appropriate
existing tool with the prompt that contains task examples associated with existing tools, as shown in
Appendix B. If the tool is identified correctly, we consider it a success. The accuracy of determining
the correct tool is 94% ± 2% over five random constructions of the test set.
Requesting tool-making. Next, we evaluate the dispatcher’s ability to request tool-making for
instances from an unseen task. We randomly select four tasks as existing tasks with tools ready. We
then pick four tasks for testing: two are unseen, and two are within the existing tasks. We generate
a test set with 100 samples. For each instance in the test set, we use the dispatcher to determine
whether it needs to request tool-making or if the instance can be solved by an existing tool. The
accuracy of making the correct request is 95% ± 4%.
The results demonstrate that the dispatcher can effectively identify existing tools and request tool-
making for unseen tasks without a significant performance drop. This suggests that LATM can be
smoothly extended to a streaming setting with a mixture of tasks.
Capacity required for the tool-making language model. We investigate the capacity requirements
for the language model used in the tool-making stage. Generally, we found that a more powerful
and expensive model better serves the purpose, as this stage is performed only once for each task,
and high accuracy is crucial for effectively passing tools to a smaller model. Specifically, on hard
tasks like Logical Deduction and Tracking Shuffled Objects, GPT-3.5 Turbo fails in all the 5 trails.
And the major failure reason is that the tool is not general enough and may only work on the training
samples. On the other hand, we also discovered that for easy tasks, the tool maker can be a lightweight
language model. For simple tasks like Word Sorting, GPT-3.5 Turbo can effortlessly generate a
program that solves the task. Another limitation that may contribute to the tool maker’s failure is the
context length constraints. Since we use the entire history in each step of tool-making to enhance the
reliability of the tool-making stage, this also introduces a longer context. In this case GPT-4 with
8192 context length is preferable.
Capacity required for the tool-using language model. In this section, we investigate the capacity
requirements for the tool-using model. The results are presented in Table 4. We observed that
GPT-3.5 Turbo offers the best balance between performance and cost among all the models tested.
Regarding the older GPT-3 series of models (ada, babbage, curie, davinci), we found that models
that before instruction tuning often perform better than their counterparts post instruction tuning. We
8
GPT-3.5 Turbo text-davinci-002 davinci curie babbage ada
Logical Deduction (5) 79.7% 58.2% 11.6% 6.5% 11.6% 3.0%
Tracking Shuffled Objects (5) 99.6% 100.0% 62.1% 20.7% 16.4% 5.2%
Dyck Language 92.2% 35.8% 16.4% 18.1% 9.1% 9.9%
Word Sorting 98.3% 60.8% 26.6% 7.3% 7.3% 0.9%
Chinese Remainder Theorem 100.0% 100.0% 99.6% 93.1% 75.0% 66.0%
Schedule Meeting 100.0% 100.0% 62.9% 59.1% 23.2% 0.0%
Cost ($ per 1K tokens) 0.002 0.02 0.02 0.002 0.0005 0.0004
Table 4: A performance comparison of various tool user models, all using the same tool generated
by GPT-4. All costs are based on the rates at the time of writing. Of all the models, GPT-3.5
Turbo demonstrates the best trade-off between performance and cost. We opted for GPT-3 models
prior to instruction tuning, as we observed that the models post instruction tuning underperformed
considerably in the tool-using stage. We postulate that this is due to the instruction tuning phase
impairing the in-context learning ability, which is essential for the tool-using stage.
hypothesize that the instruction tuning phase in these models may adversely impact the in-context
learning ability, which is crucial for the tool-using stage.
CoT as a tool does not help. In addition to LATM, we investigate if we can improve task
performance by reusing Chain-of-Thought (CoT) from a larger model to a smaller model similar to
LATM pipeline. Specifically, we use the same larger model (GPT-4) in the “CoT-making” stage,
using zero-shot prompting “Let’s think step by step.” to elicit the intermediate thought steps, and
then use the generated CoT to the same smaller tool-using model (GPT-3.5 Turbo). We test this on
two tasks and report the results Table 5. We observe that using CoT from a large model has a similar
or even worse performance than human-written CoT, which is much worse than LATM.
We introduced LATM, a closed-loop framework empowering large language models (LLMs) to create
and utilize their own tools for diverse tasks. Our approach, inspired by human’s evolutionary strides
in tool creation, employs two key stages: Tool Making and Tool Using. This division of labor allows
us to harness the capabilities of advanced LLMs while significantly reducing computational costs.
Our experiments confirmed the efficacy of LATM across various complex tasks, demonstrating that
our framework performs comparably to resource-intensive models while being more cost-effective.
In addition, we show that adding another dispatcher LLM can further provide flexibility to our
framework, enabling on-the-fly tool creation and usage.
In our evaluation process, we identified a significant lack of high-quality datasets that authentically
represent daily human-computer interactions, including recurring tasks such as scheduling meetings
or booking flights over email or phone calls, in their raw natural language format. We anticipate
that our work will stimulate the research community to create such datasets, which could prove
instrumental in cultivating the next generation of AI systems. These systems, capable of generating
and applying their own tools, will be equipped to tackle complex tasks more effectively. An exciting
avenue for future research is enabling the tool maker to refine and upgrade existing tools to manage
new problem instances, much like in software development. This adaptability could further catalyze
the evolution of the AI ecosystem, unlocking a wealth of opportunities.
9
References
Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi.
Mathqa: Towards interpretable math word problem solving with operation-based formalisms.
arXiv preprint arXiv:1905.13319, 2019.
Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer,
and Christopher Ré. Language models enable simple systems for generating structured views of
heterogeneous data lakes, 2023.
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan,
Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large
language models, 2021.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar,
Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence:
Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John
Jumper. Accelerating large language model decoding with speculative sampling. February 2023a.
doi: 10.48550/ARXIV.2302.01318.
Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while
reducing cost and improving performance, 2023b.
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting:
Disentangling computation from reasoning for numerical reasoning tasks, 2022.
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to
self-debug. ARXIV.ORG, 2023c. doi: 10.48550/arXiv.2304.05128.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm:
Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve
math word problems. arXiv preprint arXiv:2110.14168, 2021.
David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes,
Yuhuai Wu, Henryk Michalewski, Rif A. Saurous, Jascha Sohl-dickstein, Kevin Murphy, and
Charles Sutton. Language model cascades, 2022.
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and
Graham Neubig. Pal: Program-aided language models, 2023.
Google. Palm 2 technical report, 2023. URL https://ai.google/static/documents/
palm2techreport.pdf.
Daniel Conrad Halbert. Programming by example. University of California, Berkeley, 1984.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza
Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al.
Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
Michal Kosinski. Theory of mind may have spontaneously emerged in large language models. arXiv
preprint arXiv:2302.02083, 2023.
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative
decoding. November 2022. doi: 10.48550/ARXIV.2211.17192.
10
Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji,
Shaoguang Mao, et al. Taskmatrix. ai: Completing tasks by connecting foundation models with
millions of apis. arXiv preprint arXiv:2303.16434, 2023.
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale gen-
eration: Learning to solve and explain algebraic word problems. In Proceedings of the 55th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pages 158–167, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi:
10.18653/v1/P17-1015. URL https://aclanthology.org/P17-1015.
Ruibo Liu, Jason Wei, Shixiang Shane Gu, Te-Yen Wu, Soroush Vosoughi, Claire Cui, Denny
Zhou, and Andrew M. Dai. Mind’s eye: Grounded language model reasoning through simulation.
In The Eleventh International Conference on Learning Representations, 2023. URL https:
//openreview.net/forum?id=4rXMRuoJlai.
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu,
and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models.
arXiv preprint arXiv:2304.09842, 2023.
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri
Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement
with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David
Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work:
Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114,
2021.
OpenAI. Gpt-4 technical report, 2023.
Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and
Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models.
arXiv preprint arXiv:2303.09014, 2023.
Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models, 2022.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer,
Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to
use tools, 2023.
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt:
Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580,
2023.
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi,
Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Language models are mul-
tilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057, 2022.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam
Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the
imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint
arXiv:2206.04615, 2022.
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung,
Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks
and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
Petar Veličković and Charles Blundell. Neural algorithmic reasoning. Patterns, 2(7):100273, 2021.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency
improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny
Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint
arXiv:2201.11903, 2022.
11
Tongshuang Wu, Michael Terry, and Carrie Jun Cai. Ai chains: Transparent and controllable human-ai
interaction by chaining large language model prompts. In Proceedings of the 2022 CHI Conference
on Human Factors in Computing Systems, pages 1–22, 2022.
Heming Xia, Tao Ge, Si-Qing Chen, Furu Wei, and Zhifang Sui. Speculative decoding: Loss-
less speedup of autoregressive translation, 2023. URL https://openreview.net/forum?id=
H-VlwsYvVi.
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu,
Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning
and action. arXiv preprint arXiv:2303.11381, 2023.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan
Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International
Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=
WE_vluYUL-X.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher
Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language
models. arXiv preprint arXiv:2205.01068, 2022.
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in
large language models. arXiv preprint arXiv:2205.10625, 2022.
12
A Broader Impact and Limitations
This paper explores the potential of enabling Large Language Models (LLMs) to create their own
tools, thus allowing them greater autonomy in developing their ecosystem. While this avenue of
research is promising, it also raises important ethical, safety, and control considerations that need to
be carefully addressed.
One of the most significant impacts of our work lies in the potential for LLMs to grow and achieve
unprecedented capabilities automatically. This could significantly enhance the range and complexity
of tasks these models can handle, potentially revolutionizing fields such as customer service, tech-
nical support, and even areas of research and development. It could lead to more efficient use of
computational resources and a reduction in human intervention, especially for routine or repetitive
tasks.
However, this newfound autonomy of LLMs is a double-edged sword. As we endow LLMs with
the ability to generate their own tools, we also create a scenario where the quality of the tools they
develop may not always meet the standards or expectations set by human developers. Without proper
safeguards, there’s a risk that these models could generate solutions that are suboptimal, incorrect, or
even potentially harmful. Furthermore, as LLMs become more autonomous, the potential for loss
of control increases. If these models are widely used without appropriate regulation, there could be
unforeseen consequences, potentially even leading to scenarios where humans lose control over the
AI systems.
In this study, we have not addressed these control and safety issues in depth, and our work has some
limitations. Our proposed framework, LLM As Tool Maker, while effective in the tested scenarios,
is still in its early stages of development. It is crucial to note that the real-world performance and
safety of the system may vary based on the complexity and nature of the tasks it is applied to.
Additionally, the evaluation and validation of the tools created by the tool maker in a real-world
setting is a challenge that needs to be addressed.
13
B LATM Prompts
Use cases:
Question: {question (including options)}
Solution:
```python
{parse the question into the arguments of the function}
{call the function and save the return value in a variable named
,→ "ret"}
{for multiple choice question, parse the options}
{convert the return value "ret" to the answer (if the question is a
,→ multiple choice question, convert to an option) and save it in a
,→ variable named "ans", otherwise}
```
Do this for all the questions in the verification step.
14
Dispatcher Prompt
Task: logical_deduction_five_objects
===
Task: word_sorting
===
Skip other tasks
15
16
C Wrapped Tools
```python
from itertools import permutations
Use cases:
```python
objects = ["white", "green", "brown", "gray", "orange"]
constraints = [
lambda order: order.index("gray") > order.index("orange"),
lambda order: order.index("green") == len(order) - 2,
lambda order: order.index("brown") > order.index("white"),
lambda order: order.index("brown") < order.index("orange")
]
17
Tool for Tracking Shuffled Objects
```python
def square_dance(initial_partners, switches):
# Create a dictionary to store the current partners
current_partners = dict(initial_partners)
return current_partners
```
Use cases:
Question: Alice, Bob, Claire, Dave, and Eve are on the same team in a
,→ soccer match. At the start of the match, they are each assigned
,→ to a position: Alice is playing goalkeeper, Bob is playing left
,→ midfielder, Claire is playing right winger, Dave is playing
,→ striker, and Eve is playing center midfielder.
As the game progresses, pairs of players occasionally swap positions.
,→ First, Alice and Claire trade positions. Then, Alice and Bob
,→ trade positions. Then, Dave and Bob trade positions. Then, Bob
,→ and Eve trade positions. Finally, Dave and Eve trade positions.
,→ At the end of the match, Eve is playing
Options:
(A) goalkeeper
(B) left midfielder
(C) right winger
(D) striker
(E) center midfielder
Answer: (C)
Solution:
```python
initial_positions = [("Alice", "goalkeeper"), ("Bob", "left
,→ midfielder"), ("Claire", "right winger"), ("Dave", "striker"),
,→ ("Eve", "center midfielder")]
switches = [("Alice", "Claire"), ("Alice", "Bob"), ("Dave", "Bob"),
,→ ("Bob", "Eve"), ("Dave", "Eve")]
18
Tool for Dyck Language
```python
def complete_sequence(input_str):
stack = []
closing_map = {'(': ')', '[': ']', '<': '>', '{': '}'}
result = []
while stack:
result.append(closing_map[stack[-1]])
stack.pop()
return ''.join(result)
```
Use cases:
Question: Complete the rest of the sequence, making sure that the
,→ parentheses are closed properly. Input: ([[[{}]]{<[<[{}]>]>}
Answer: ])
Solution:
```python
input_str = "([[[{}]]{<[<[{}]>]>}"
ret = complete_sequence(input_str)
ans = ret
```
Skip two more questions...
19
Tool for Word Sorting
Use cases:
Solution:
```python
words1 = ["conference", "apparition", "ignore", "dutton", "layperson",
,→ "coupe", "superstitious", "westward", "turnoff", "messenger",
,→ "copra", "floruit", "primitive", "implement"]
ret1 = sort_words_alphabetically(words1)
ans1 = " ".join(ret1)
```
Skip two more questions...
```python
def find_number(max_limit, divisors, remainders):
for num in range(max_limit + 1):
if all((num - remainder) % divisor == 0 for divisor,
,→ remainder in zip(divisors, remainders)):
return num
return None
```
Use cases:
Solution:
```python
max_limit = 1188877
divisors = [41, 107, 271]
remainders = [17, 42, 260]
ret = find_number(max_limit, divisors, remainders)
ans = ret
```
Skip two more questions...
20
Tool for Schedule Meeting
```python
from datetime import datetime, timedelta
return None
```
Use cases:
Question: A and B want to schedule a 1-hour meeting together. A's
,→ availability: 12:00 - 12:30, 13:00 - 13:30, 14:30 - 15:30, 17:30 -
,→ 18:00. B's availability: 09:00 - 11:00, 12:00 - 12:30, 13:00 -
,→ 13:30, 15:30 - 16:30, 17:30 - 18:00. What time slot works best?
,→ (if multiple, choose the earliest one)
Answer: No time slot works.
Solution:
```python
a_availability = [('12:00', '12:30'), ('13:00', '13:30'), ('14:30',
,→ '15:30'), ('17:30', '18:00')]
b_availability = [('09:00', '11:00'), ('12:00', '12:30'), ('13:00',
,→ '13:30'), ('15:30', '16:30'), ('17:30', '18:00')]
meeting_duration = 60
D Dataset Construction
For the “schedule meeting” task, we use the following template to generate the dataset:
question_format = """A and B want to schedule a {interval}-hour meeting
,→ together.
A's availability: {A_availability}
21
B's availability: {B_availability}
What time slot works best? (if multiple, choose the earliest one)"""
where the interval is randomly sampled from {0.5, 1, 1.5}, and the availability of A and B are
randomly sampled from 8:00-18:00 with 30 minutes as the granularity. The answer is computed by
computing the intersection of the two availability sets and then find the earliest time slot that is at
least as long as the meeting duration. If there is no such time slot, we return “No time slot works.”.
22