0% found this document useful (0 votes)
31 views22 pages

Learning To Use Tools Via Cooperative and Interactive Agents

Uploaded by

sagar srinivas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views22 pages

Learning To Use Tools Via Cooperative and Interactive Agents

Uploaded by

sagar srinivas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Learning to Use Tools via Cooperative and Interactive Agents

Zhengliang Shi1 Shen Gao1 Xiuyi Chen2 Lingyong Yan2 Haibo Shi2
Dawei Yin2 Zhumin Chen1 Pengjie Ren1 Suzan Verberne3 Zhaochun Ren3∗
1
Shandong University, Qingdao, China 2 Baidu Inc., Beijing, China
3
Leiden University, Leiden, The Netherlands
shizhl@mail.sdu.edu.cn, z.ren@liacs.leidenuniv.nl

Question: Can you tell me what is the latest movie in December 24, 2023 directed
Abstract by Lilly Wachowski?

Tool learning empowers large language models (1) Selection


Tool name (1) Grounding
(LLMs) as agents to use external tools to extend
arXiv:2403.03031v2 [cs.CL] 26 May 2024

Instruction
(2)
their capability. Existing methods employ one (3) Result
Execution
Target value
single LLM-based agent to iteratively select Incorporation Tool
server ❌ ❌
and execute tools, thereafter incorporating
Success
the result into the next action prediction. Code Results Args
Executed (3) (2)
results Observation Execution
However, they still suffer from potential ❌ Failure
Code Tool
performance degradation when addressing Skip Interpreter server

complex tasks due to: (1) the limitation


(a) Previous works (b) Our method
of the inherent capability of a single LLM
to perform diverse actions, and (2) the Figure 1: Comparison between existing single-agent
struggle to adaptively correct mistakes when tool learning method (a) and our ConAgents (b).
the task fails. To mitigate these problems,
we propose the ConAgents, a Cooperative
and interactive Agents framework, which et al., 2023; Hao et al., 2023) or using in-context
modularizes the workflow of tool learning into demonstrations to prompt the LLMs to execute
Grounding, Execution, and Observing agents.
tools (Lu et al., 2023; Shen et al., 2023; Yang et al.,
We also introduce an iterative calibration
(IterCali) method, enabling the agents to adapt
2023b). As shown in Figure 1(a), both methods
themselves based on the feedback from the employ the agents to interleave multiple actions
tool environment. Experiments conducted on in a pre-defined order, where the agents iteratively
three datasets demonstrate the superiority of select a proper tool, pass arguments for execution,
our ConAgents (e.g., 6 point improvement and incorporate the execution results into the next
over the SOTA baseline). We further provide action prediction until they derive the answer.
fine-granularity analysis for the efficiency and Despite the advancement, existing tool learning
consistency of our framework.
methods with a single agent typically encounter
1 Introduction two challenges when tackling complex tasks: On
the one hand, solving a task involves varied
Although achieving remarkable performance in actions with significant differences (Qiao et al.,
a broad range of natural language processing 2024; Dziri et al., 2023). As shown in Figure 1,
tasks (Wang et al., 2023c; Chang et al., 2023), large the selection requires the commonsense reasoning
language models (LLMs) still encounter inherent of agents to match the most relevant tool with
limitations such as out-of-date information (Chang the input task while execution requires the agent
et al., 2023; Mallen et al., 2023). To improve their to complete appreciate arguments following the
proficiency in tackling concrete complex tasks, tool intricate tool documentation. However, a single
learning is proposed to equip LLMs with various agent typically suffers from the limited inherent
auxiliary resources, e.g., a search engine (Qin et al., capability to acquire all required actions (Qiao
2023a; Nakano et al., 2021) or a calculator (Schick et al., 2024; Xu et al., 2024; Yin et al., 2023). On
et al., 2023; Qin et al., 2023b). Previous studies the other hand, most prior work tackles a task
empower LLMs as tool-use agents by training with a fixed pre-defined order (Yao et al., 2023;
on a tool-use dataset (Yang et al., 2023a; Tang Prasad et al., 2023), where they directly shift to

Corresponding author. the next step though the current step fails. This
mechanism cannot support adaptively calibrating in the response. In addition, IterCali enables
the exceptional errors that frequently occurred the observing agent to revise the generated
during the tool-use workflows (Zhuang et al., 2023; function used to extract target values based on the
Wang et al., 2023b). For example, when failing to programming error raised by the interpreter. Our
invoke tools, we should enable agents to revise the proposed method IterCali enables the agents to
arguments used to call the tool based on exception interact with the environment, e.g., tool servers
information instead of directly shifting to the next and interpreters, learning from external feedback
step with the error response of previous steps. to calibrate themselves.
To address the above problems, we propose Experiments on three datasets demonstrate that
the ConAgents, an LLM-based Cooperative and our ConAgents outperforms the state-of-the-art
interative Agents framework for tool learning baseline (6% point improvement in Success Rate
tasks. ConAgents has two objectives to address metric on average).
the limitations of previous work: (1) modularize Our contributions are summarized as follows:
the workflow with different agents and enable (1) We propose ConAgents, a cooperative and
agents’ cooperation to solve a complex task; (2) interactive agents framework, for the tool learning
enable agents to interact with tool environments, tasks. ConAgents consists of the Grounding,
calibrating themselves adaptively. Execution, and Observing agents, synergizing
For the first objective, ConAgents decomposes specialized agents to solve a complex task. (2) We
the overall workflow into three modules including: propose an Iterative Calibration (IterCali) method,
Grounding, Execution, and Observing. As shown enabling agents to calibrate themselves utilizing
in Figure 1(b), each module is managed by an the feedback from the tool environment when they
independent LLM-based agent. The grounding fail to complete their actions. (3) Both automatic
agent reasons the input task of the user and and human evaluation conducted on three datasets
ground it into a tool-use instruction, specifying indicates the superiority of ConAgents over strong
which tool to use and the target output. The baselines1 .
generated instruction guides the execution agent to
complete the required arguments for the selected 2 Related Work
tools and request data from tool servers. The
LLMs for tool learning. Enhancing LLMs with
observing agent addresses the problem of how
external tools has been proven a promising method
to effectively incorporate the lengthy execution
for solving practical tasks (Qin et al., 2023b;
results into the task-solving trajectory. Different
Gao et al., 2023a; Jin et al., 2023). Previous
from current solutions which rely on a pre-
works empower a tool-learning agent typically by
defined schema of tool response to filter irrelevant
supervised fine-tuning (Patil et al., 2023; Ye et al.,
information (Song et al., 2023), we propose a
2024b; Gao et al., 2023b) or prompt learning (Wu
schema-free method. This method enables the
et al., 2023; Cai et al., 2023). Specifically, the
observing agent to dynamically generate a function
former trains LLMs with specialized tool-use
adaptive to extracting target values, which is
dataset (Li et al., 2023; Huang et al., 2023; Ye et al.,
then sent to the grounding agent to generate an
2024a), teaching LLMs how to use tools from the
instruction for the next iteration. When solving
data (Schick et al., 2023). The latter leverages the
a complex task, our framework operates in an
inherent in-context learning capability of LLMs to
iterative manner.
use various tools, where the demonstration and
For the second objective, considering that
usage are taken as the prompt(Paranjape et al.,
exceptional errors frequently occur when the
2023; Kim et al., 2023). However, solving complex
agents execute tools or extract target values
tasks with tools typically involve various actions,
by programming, we introduce the Iterative
e.g., deciding which tools to use, what arguments
Calibration (IterCali) method to enable the agents
to pass, and how to utilize the results (Schick
to interact with tool environment and iteratively
et al., 2023; Huang et al., 2023). Therefore,
calibrate themselves. Specifically, when failing to
compelling one single agent to learn all abilities
execute the tools, IterCali appends the response
places even greater pressure on them (Yin et al.,
from tool servers into the context of execution
2023; Prasad et al., 2023). In addition, as
agent, prompting execution agent to calibrate the
1
generated arguments following the error messages https://github.com/shizhl/ConAgents
the tasks become complex, LLMs-based agents (IterCali) method. IterCali enables these agents
struggle to incorporate the lengthy task-solving to utilize the feedback from the tool environment,
trajectory to predict a correct action for the next calibrating themselves when they fail to complete
step (Qiao et al., 2024; Chen et al., 2023). Our their tasks assigned.
proposed ConAgents aims to modularize the overall
workflow, synergizing specialized agents to solve a 3.2 Grounding Agent
complex task. The grounding agent is designed to resolve the
input task to a series of sub-steps and ground each
Multi-agent cooperation. Synergizing multiple
step into a tool-use instruction. Specifically, at
agents has demonstrated strong performance on
ith iteration, the grounding agent generates the
a variety of tasks (Liu et al., 2023; Sun et al.,
instruction Ii on the condition of task Q and
2023; Zhang et al., 2023) and has emerged as a
current trajectory HG i = {(Ij , rj )|j < i}, which
promising approach to enhance the capabilities
consists of the accumulation of previous instruction
of individual agents (Talebirad and Nadiri, 2023;
I<i and results r<i . It can be formulated as:
Mohtashami et al., 2023). Recent studies take
multiple agents instances into a debate for a fixed Ii = MG (HG i , q, T ), (1)
number of rounds (Wang et al., 2023a; Liang et al.,
2023) so that boosting their factuality (Cohen et al., where the T indicates the toolset, which consists
2023) and reasoning capacities (Du et al., 2023; Fu of various tools. The Ii specifies which tool to use
et al., 2023). However, they neglect to decompose and the target output to be extracted. For example,
a complex task and reduce the workload for each given a task like What is the movie released on
agents (Yin et al., 2023). Other work aims Dec 24, 2023, the instruction can be Use the tool
to address complex mathematical problems and <search_tv> to find a movie shown on Dec 24, 2023.
code generation (Qian et al., 2023b,a) but cannot The generated instruction is then used to instruct
adaptively correct the actions of agents when the execution agent to execute the selected tool.
tasks fail. In our work, we enable the agents
3.3 Execution Agent
to adapt themselves by interacting with the tool
environment with external feedback. Following the generated instruction, the execution
agent ME executes the selected tool by completing
3 Methodology the required arguments and requesting data from
tool servers. Specifically, for ith step, the
3.1 Overall Framework
instruction Ii paired with the documentation of
Our cooperative and interactive framework the selected tool Di is taken as input by ME to
ConAgents is proposed with two objectives: (1) generate appreciate arguments ai , which can be
enabling the cooperation of agents to solve complex formulated as:
tasks; and (2) enabling the agents to adapt their
actions when encountering errors. To achieve ai = ME (Di , Ii ).
the first objective, ConAgents modularizes the
workflow of tool learning tasks into a grounding The arguments ai are used to invoke the tool:
agent MG , execution agent ME , and observing Ri = Execute(ai ),
agent MO . As shown in Figure 2, ConAgents
alternates these three agents for multiple iterations where the Execute(∗) indicates requesting the data
to solve a complex task Q. Specifically, the MG from backend servers of tools, and Ri indicates
is first employed to decompose the task Q into a the execution results. When the tool fails to be
simpler sub-task and generate a tool-use instruction executed successfully, the Ri indicates an error
I. Then, I paired with the tool documentation D message as a failure signal. When the tool is
instructs ME to execute the selected tool. Since the executed successfully, the result Ri contains the
execution results R could be lengthy, the observing targeted information in response to the instruction
agent is designed to extract the relevant value Ii . However, since the results Ri can be lengthy,
r from R. The r is then incorporated into the directly inserting Ri into the context of the agents
context of the grounding agent MG to predict the may lead to hallucination (Shi et al., 2023) and
instruction in the next iteration. For the second typically out of the context length. Thus observing
objective, we introduce an Iterative Calibration agent MO is proposed to extract relevant values.
Question: Can you recommend the latest TV series released on December 24, 2023, directed by
Lilly Wachowski?
(2) Execution
GET/POST…
Arguments: {"date":…}
(1) Grounding Search TV
Search TV Calibration
Image Execution
Agent Bad request Tool Server
Math
Weather
Execution Results

(3) Observation
Target Code for
First, use <Search TV>
Values extraction
tool to search TV directed
by Lilly Wachowski in Calibration
Dec 24, 2023 Observing
Grounding Code Exceptional Agent
Agent Interpreter Error

Figure 2: Our cooperative and interactive agent framework ConAgents. The grounding agent generates tool-use
instructions while execution agent executes selected tools. The observing agent extracts relevant values from
execution results. When receiving feedback from the tool environment, the execution agent and observing agent
adapt themselves iteratively.

3.4 Observing Agent pseudo algorithm for our schema construction in


The observing agent is employed to process the Appendix A.5.
lengthy results. Most existing methods truncate or When the code ci is invoked successfully, the
summarize (Qin et al., 2023c; Yao et al., 2023) the obtained value ri is incorporated into the context
results to fit the context length, which potentially of grounding agent to generate the instruction Ii+1
loses the required information to solve the task for the next iteration. Otherwise, the ri indicates a
in subsequent steps. Recent work simplifies the programming error, which instructs the observing
lengthy results using the pre-created schema (Song agent to calibrate adaptively (Section 3.5).
et al., 2023), which is a documentation that
3.5 Iterative Calibration
elaborates on the examples, format, and possible
error. However, these methods are limited when Previous studies solve a complex task in a pre-
dealing with unrecorded tools in general scenarios, defined order. This mechanism tends to directly
such as newly emerged or updated tools. Therefore, shift to the next iteration when the agents fail to
we propose a schema-free method, enabling the complete actions, limiting their ability to adaptively
observing agent to dynamically generate a function correct exceptional errors (Prasad et al., 2023).
adaptive to extracting the target output following As shown in Figure 3, we propose an iterative
the instruction Ii . Specifically, we first construct calibration method (IterCali), to mitigate this
a pseudo schema by recursively decomposing problem. IterCali enables the execution agent and
each element in Ri , representing the hierarchical the observing agent to interact with tool servers and
structure of Ri , and listing the type of the code interpreters, calibrating themselves following
corresponding value as the schema Si . The pseudo the error messages in feedback.
schema Si instructs the observing agent MO to
generate code ci and utilize an interpreter to extract Iterative calibration with tools. The agents
the relevant values ri : suffer from calling the tool with wrong arguments
due to the complex tool documentation and
ci = MO (Ii , Si ), argument specifications (Zhuang et al., 2023;
(2) Gao et al., 2023b). Our proposed IterCali
ri = Interpret(ci ).
mitigates this problem by enabling the execution
The Interpret(∗) indicates invoking the code to agent ME to iteratively calibrate the misaligned
obtain the result. We provide an example and the arguments following the feedback from tool
You are a helpful assistant, who can access external ‘ tools ’ and where Si is the synthetic schema in Section 3.4,
execute the instruction: {instruction}
and j indicates the iteration time. {(c<j <j
i , ri )}
Please complete the arguments and request data from servers
based on the tool documentation: {documentation}
indicates the calibration history which consists
of the generated code c<j i and programming
When receiving error messages, you should revise your output <j
and try again. results ri from the code interpreter. The above
{calibration history} calibration is operated until the code is invoked
successfully or up to the maximum iteration β.
Generated Error
arguments messages
...
(a) Iterative calibration with tool servers
4 Experimental Setup
Here is a Python `dict` called `response`. Please write code to
extract values from the `response` following the instruction:
4.1 Datasets and Evaluation Metrics
{instruction}
Datasets. We conduct experiments on three
The structure of `response` is simplified as below: {schema}. datasets with tool learning tasks, including: (1)
Please revise your code when you receive the exceptional errors
from a Python interpreter. RestBench-TMDB (Song et al., 2023), a high-
Your Output: ```[Python code] ``` quality human annotated dataset consisting of
{calibration histroy}
54 tools about movie scenarios; (2) RestBench-
Spotify (Song et al., 2023), a dataset with 40 tools
Code for Exception
Extraction error for music scenarios. (3) ToolBench (Qin et al.,
...
2023c), a dataset containing diverse real-world
(b) Iterative calibration with code interpreter
tools across various applications.
Figure 3: The prompts for our Iterative Calibration Evaluation metrics. We use two evaluation
(IterCali) method, which instructs the execution agent metrics following (Yang et al., 2023a; Gao et al.,
to calibrate generated arguments (a) and observing agent 2023b): (1) Success Rate (Success%) measuring
to refine generated code (b) with external feedback.
the proportion of successful query completions,
and (2) Correct Path Rate (Path%) calculating
servers. Specifically, when the wrong argument ai F1 score between the generated tool sequence
generated by the execution agent is used to request and ground-truth tool sequence. We also conduct
the data from tool servers, the execution result ri the human evaluation, where three well-educated
presents the error messages, showing a failure to volunteers are invited to evaluate 30 randomly
execute the tool (shown in Equation 3.3). We then sampled cases with a three-scale rating in two
iteratively append the Ri into the calibration history aspects: (1) Executability (Exec): whether multiple
of execution agent ME , explicitly prompting the tools are invoked in a correct logical order (2)
ME to calibrate the generated arguments, which Correct Rate of Parsing (Parsing): whether the
can be formulated as: agents can extract the relevant value from lengthy
aji = ME (Di , Ii , {(a<j <j results and incorporate it to predict the next action.
i , Ri )}), (3)
Since existing datasets lack the ground truth for
where Di and Ii indicate the tool documentation extracting relevant values, we extend the TMDB
and instruction. The {(a<j <j
i , Ri )} indicates the and Spotify datasets by manually annotating the
calibration history, containing the arguments a<j
i fine-granularity task-solving trajectory to provide
and response Ri<j from tool servers. The above the reference for our human evaluation. More
calibration is operated until the tool is executed details can be found in Appendix A.1.
successfully or up to the maximum iteration α.
4.2 Baselines
Iterative calibration with code interpreter. Our
We mainly compare our method with agent-based
IterCali enables the observing agent to calibrate the
tool learning methods, including: (1) DEPS (Wang
generated code. Specifically, when failing to invoke
et al., 2023d), an interactive planning approach
the code ci , the interpreter raises an exception, e.g.,
empowered by LLMs to improve the long-term
Type error or Exception for out of List range as ri .
planning ability; (2) Chameleon (Lu et al., 2023),
The ri is then used to prompt the observing agent
an LLM-based agent that directly generates multi-
MO to adapt the generated content in an iterative
step plans for tool use and then sequentially
manner, which is formulated as:
executes the plan; (3) ReAct (Yao et al., 2023),
cji = MO (Ii , Si , {(c<j <j
i , ri )}), (4) which prompts LLM to generate the chain-of-
RestBench-TMDB RestBench-Spotify ToolBench
Method
Success Rate Path% Success Rate Path% Success Rate Path%
DEPS (Wang et al., 2023d) 38.00 43.00 19.30 36.40 - -
ReAct (Yao et al., 2023) 40.00 71.19 51.28 60.35 39.39 65.04
Chameleon (Lu et al., 2023) 63.00 66.10 56.20 64.55 37.44 67.55
ToolLLM (Qin et al., 2023c) 68.00 76.77 61.69 74.77 66.39 86.43
² RestGPT (Song et al., 2023) 65.00 69.21 69.10 72.30 65.88 81.40
² ConAgents (ours) 77.00 77.97 73.68 76.10 70.88 82.40
ReAct@N → N = 2 54.00 67.90 56.71 59.47 41.41 63.67
ReAct@N → N = 3 62.00 65.40 58.13 63.26 42.67 66.12
ReAct@N → N = 4 66.00 64.34 60.13 65.48 44.58 65.35
ReAct@N → N = 5 67.00 65.41 61.21 64.46 47.47 67.70
Ablation Study
² - w/o ME 61.00 65.57 64.23 64.37 67.55 78.37
² - w/o IterCali → Server 70.00 68.41 68.19 72.42 68.00 69.24
² - w/o IterCali → Code 68.00 64.24 67.11 73.27 65.33 70.14

Table 1: The results on three datasets. The Path% indicate the Correct Path Rate metric. The icon denotes the
single-agent method and ² symbolizes multi-agent architecture. The bottom of the table show the ablation study.

TMDB Spotify iteration number N. We set the N from 1 to 5


Method
Success% Path% Success% Path% in our experiment. We also consider the baseline
Mistral-8x7B with multi-agent architecture, i.e., RestGPT, which
ReAct 26.00 61.21 21.35 47.21 includes a coarse-to-fine planning module and a
ReAct@3 33.00 63.27 26.93 50.31 tool executor (Song et al., 2023).
ToolLLM 37.00 64.32 28.07 52.31
² RestGPT 34.00 72.20 31.58 67.82 4.3 Implementation Details
² ConAgents 41.00 70.31 35.09 64.37 We employ gpt-3.5-turbo2 from OpenAI as the
LLM backbone for each agent in our method and all
Table 2: We employ the Mixtral-8x7B as the backbone
baselines. We instruct the three agents to perform
LLM of for our method and baselines.
specific actions with different system prompts
shown in Appendix A.4. The decoding temperature
TMDB Spotify is set to 0 for the most deterministic generation.
Method
Exec Parsing Exec Parsing We also repeat the experiment with an open-source
gpt-3.5-turbo model Mistral-8x7B3 for further comparison. In
ReAct 1.89 1.93 1.77 2.10 our iterative calibration method (Section 3.5), we
ReAct@3 2.30 1.97 2.10 2.05 set the maximum number of iterations α = 3 and
ToolLLM 2.26 1.87 2.26 2.30 β = 3, respectively. The maximum step for DEPS
² RestGPT 2.45 2.53 2.40 2.40
and the maximum depth for ToolLLM (DFSDT)
² ConAgents 2.76 2.60 2.60 2.40
are set to 10. For each sample in the test set, we
Table 3: Human evaluation on Executability (Exec) and provide all the baselines with the same candidate
Correct Rate of Parsing (Parsing). toolset for a fair comparison, which contains the
required tools and ten randomly sampled tools. We
also add a special tool named Finish, which is
thought and actions in an interleaved manner. (4) used to indicate the task is solved.
ToolLLM (DFSDT, Qin et al., 2023c), which
enhances LLMs with the Depth First Search-based 5 Results and Analysis
Decision Tree (DFSDT) to select tools to solve
5.1 Experimental Results
a task. Since our ConAgents synergizes three
specialized agents, we also establish a ReAct@N Overall performance. Table 1 demonstrates
baseline for further comparison, which conducts experimental performances of all methods. We find
multiple times of ReAct until the input task is 2
https://openai.com/chatgpt
3
completed successfully or up to the maximum https://huggingface.co/mistralai
that our proposed ConAgents outperforms all the Ours ReAct@1 ReAct@2 ReAct@3
baselines in three datasets in terms of all metrics. ∆ Path Len 1.919 1.714 4.261 5.851
A reason here is that our cooperative framework
design enables each agent to perform specialized Table 4: The ∆ Path Len for different methods on the
actions instead of grasping all required capabilities, TMDB dataset.
thereby reducing the workload encountered by
a single agent. The significant improvement from 77.00 to 61.00 on the TMDB dataset. The
over ReAct@N baselines can further validate the same trend is also observed in the Correct Path
effectiveness of our method. In addition, unlike Rate, e.g., a 9.45 absolute decrease on the Spotify
RestGPT, ConAgents achieves 12% improvement dataset. These results indicate the necessity of the
in Success Rate and 8.76% absolute improvement observing agent in solving complex tasks.
in Correct Path Rate, respectively. This result - w/o IterCali → Server. We remove the iterative
suggests that our proposed IterCali method can calibration when executing the tools (Section 3.5).
enable the agents to adaptively calibrate themselves As shown in Table 1, the Success Rate has a 5.27
when they fail to complete their tasks, improving average decline, while the Correct Path Rate has
the performance of the overall workflow. a 5.08 average decline on three datasets. These
results indicate that the execution agent can refine
Performance with the open-souce LLM. We
generated arguments following the error messages
alternate the backbone LLMs with the open-
from tool servers, improving the performance of
source model, i.e., Mistral-8x7B, and repeat
the overall framework.
the experiment in the same setting for further
- w/o IterCali → Code. We remove the iterative
comparison. Shown in Table 2, we find Mistral-
calibration when programming to extract the
8x7B can be directly adopted in ConAgents with
relevant values (Section 3.5). As shown in Table 1,
better performance, e.g., pushing the Success
both Success Rate and Correct Path Rate have
Rate from 34.00 to 41.00 in the TMDB dataset.
a significant decrease, which indicates that the
These results further prove the effectiveness of our
observing agent can revise the programming error
cooperative framework.
in the generated code following the feedback from
5.2 Human Evaluation the interpreter and extract the relevant values.

Table 3 shows the results of the human evaluation. 5.4 Case Study
We find that ConAgents achieves the best in We conduct the case studies and find that our
the Executability aspect with 0.20~0.31 absolute proposed framework is more effective at executing
improvement compared with the strong baseline various tools and incorporating the results to solve
RestGPT. We also observe our method achieves a complex task. We also provide concrete examples
comparable and even better performance in Correct to intuitively explain the detailed calibration
Rate of Parsing compared with RestGPT which process. The details can be found in Appendix A.2.
uses pre-defined schema. This result illustrates that
our observing agent can efficiently extract relevant 6 Discussions
value by programming with synthetic schema. The
Kappa statistics for Executability and Correct Rate Qualitative analysis for the maximum number
of Parsing are 0.75 and 0.71 on TMDB dataset, of iterations. The IterCali method prompts the
while 0.69 and 0.73 on Spotify dataset, illustrating execution agent and observing agent to iteratively
agreement among the annotators. calibrate themselves based on feedback from
the tool environment. To further explore the
5.3 Ablation Study impact of the maximum iteration number on the
To better understand the impact of different performance of the overall framework, we conduct
components of our method, we employ the the qualitative analysis for the hyper-parameters
following modifications to the architecture. α and β. Specifically, we alternate the α and β
- w/o MO . We replace the observing agent in from 1 to 5, respectively. Then we evaluate our
our cooperative agent framework with directly method using the RestBench-TMDB dataset with
truncating. We observe a 9.60 average decrease in the same setting as Table 1 and show the results
the Success Rate on three datasets, e.g., dropping for each iteration number in Figure 4. We observe
Ours Ours w/o Iteration 1.2 1e-4 1.6 1e-4

82 : 1 5, = 3 85 : 1 5, = 3 0.9 : 12521.73 1.2 : 12065.80

Frequency

Frequency
Correct Path Rate
Success Rate

74 75 0.6 0.8
66 65 0.3 0.4
58 55 0.0 10000 20000 30000 40000 0.0 10000 20000 30000
50 1 2 3 4 5 45 1 2 3 4 5 Token Consumption of ours Token Consumption of ReAct@1
Iteration number Iteration number 1.6 1e-4 1.2 1e-4
82 : 1 5, = 3 85 : 1 5, = 3
Correct Path Rate
1.2 : 14813.26 0.9 : 21883.48

Frequency

Frequency
Success Rate

74 75 0.8 0.6
66 65
0.4 0.3
58 55
0.0 0.0 0
50 1 2 3 4 5 45 1 2 3 4 5 10000 20000 30000 25000 50000 75000
Token Consumption of ReAct@2 Token Consumed of ReAct@3
Iteration number Iteration number
Figure 5: The efficiency analysis for different methods,
Figure 4: The qualitative analysis for the maximum
where we count the distribution of consumed tokens and
iteration number α (Section 3.5) and β (Section 3.5) in
compute the average consumption µ.
our proposed IterCali method using TMDB dataset.

steps than strong baselines ReAct@N (N=2 or 3).


an increasing Success Rate when the maximum
The decrease in the ∆ Path Len further proves the
iteration number shifts from 1 to 3, which indicates
effectiveness of our cooperative agent framework.
that the agents can adapt their actions to complete
the tasks. We also find a relatively stable trend The runtime consistency of our method.
when the α and β keep increasing (from 3 to Considering the non-deterministic generation of
5), which indicates the agents can calibrate most LLMs in nature, we further explore the consistency
of the errors in 3 iterations. We also dive into of our method. Specifically, we repeat our method
the non-calibrated cases where we find that when with the same setting as Table 1. The statistical
the grounding agent generate an inappropriate significance of differences observed between the
instruction, e.g., lacking required arguments, the performance of two runs is tested using a two-tailed
agents struggle to correct themselves only with the paired t-test. We found no significant difference
feedback from the environment. between the results of two randomly conducted
experiments (significance level α = 0.05).
Qualitative analysis for the efficiency of
inference. Due to the intensive inference cost
7 Conclusions
of LLMs-based agents, we further explore the
efficiency of our ConAgents. To explain more We present a cooperative and interactive agents
intuitively, We compare the token consumption for framework (ConAgents) for tool learning tasks
the ConAgents and ReAct@N using the RestBench- with two objectives: (1) enable the cooperation
TMDB dataset with the same setting in Table 1, between agents to solve complex tasks; and (2)
where the N is set from 1 to 3. We show the enable the agents to calibrate themselves when
frequency histogram for the number of consumed they fail their tasks. To achieve the first objective,
tokens of different methods in Figure 5. We find our ConAgents modularize the workflow into the
that our framework spends fewer tokens compared grounding agent, execution agent, and observing
with strong baselines ReAct@3 with almost the agent. Then, instruct the three agents to generate
same consumption as vanilla ReAct. The reason a tool-use instruction, execute the tools, and
is that the modularized framework ConAgents extract the key values, respectively, solving a
enables each agent to perform specific tasks more complex task via cooperation. For the second
efficiently, reducing the long exploration trajectory objective, the ConAgents also enables the agents
by the single agent. to interact with the tool environment, utilizing
We further compute the ∆ Path Len for each the external feedback to adaptively calibrate
method as previous works Song et al. (2023) and themselves. Extensive experiments conducted on
Yang et al. (2023a) in Table 4, which evaluates the three benchmarks demonstrate the superiority of
mean number of extra tools involved to solve a task. our ConAgents, e.g., pushing the success rate to
We find that our method solves a task using fewer 77.00 (13.2% relative improvement).
Limitations
The main limitation is that our LLM-based agent
is limited when perceiving multi-modality tasks.
When executing the tools, we represent the image
and speech with url following previous works.
In the future, we plan to extend our method
to the agents empowered by the multi-modality
foundation model.

Ethics Statement
The paper proposes a cooperative agent framework,
synergizing modularized agents to solve a complex
task. The modularized design enables the agents
to utilize feedback from the tool environment to
calibrate themselves adaptively. All the tools used
in our experiment are provided by open-source
platforms, including TMDB, Spotify, and Rapid
API.
References Geunwoo Kim, Pierre Baldi, and Stephen McAleer.
2023. Language models can solve computer tasks.
Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, arxiv.
and Denny Zhou. 2023. Large language models as
tool makers. arxiv. Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song,
Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang,
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, and Yongbin Li. 2023. Api-bank: A comprehensive
Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, benchmark for tool-augmented llms. In EMNLP.
Cunxiang Wang, Yidong Wang, et al. 2023. A survey
on evaluation of large language models. ACM T Intel Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang,
Syst Tec. Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and
Shuming Shi. 2023. Encouraging divergent thinking
Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, in large language models through multi-agent debate.
Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, arxiv.
Yaxi Lu, Ruobing Xie, et al. 2023. Agentverse:
Facilitating multi-agent collaboration and exploring Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and
emergent behaviors in agents. arxiv. Diyi Yang. 2023. Dynamic llm-agent network: An
llm-agent collaboration framework with agent team
Roi Cohen, May Hamri, Mor Geva, and Amir optimization. arxiv.
Globerson. 2023. Lm vs lm: Detecting factual errors
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-
via cross examination. arxiv.
Wei Chang, Ying Nian Wu, Song-Chun Zhu, and
Jianfeng Gao. 2023. Chameleon: Plug-and-play
Yilun Du, Shuang Li, Antonio Torralba, Joshua B compositional reasoning with large language models.
Tenenbaum, and Igor Mordatch. 2023. Improving arxiv.
factuality and reasoning in language models through
multiagent debate. arxiv. Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das,
Daniel Khashabi, and Hannaneh Hajishirzi. 2023.
Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine When not to trust language models: Investigating
Li, Liwei Jian, Bill Yuchen Lin, Peter West, Chandra effectiveness of parametric and non-parametric
Bhagavatula, Ronan Le Bras, Jena D Hwang, et al. memories. In ACL.
2023. Faith and fate: Limits of transformers on
compositionality. arxiv. Amirkeivan Mohtashami, Florian Hartmann, Sian
Gooding, Lukas Zilka, Matt Sharifi, et al. 2023.
Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Social learning: Towards collaborative learning with
2023. Improving language model negotiation with large language models. arxiv.
self-play and in-context learning from ai feedback.
arxiv. Reiichiro Nakano, Jacob Hilton, S. Arun Balaji, Jeff Wu,
Long Ouyang, Christina Kim, Christopher Hesse,
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Shantanu Jain, Vineet Kosaraju, William Saunders,
Pengfei Liu, Yiming Yang, Jamie Callan, and Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen
Graham Neubig. 2023a. PAL: Program-aided Krueger, Kevin Button, Matthew Knight, Benjamin
language models. In PMLR. Chess, and John Schulman. 2021. Webgpt: Browser-
assisted question-answering with human feedback.
Shen Gao, Zhengliang Shi, Minghang Zhu, Bowen arxiv.
Fang, Xin Xin, Pengjie Ren, Zhumin Chen, and
Jun Ma. 2023b. Confucius: Iterative tool learning Bhargavi Paranjape, Scott M. Lundberg, Sameer
from introspection feedback by easy-to-difficult Singh, Hanna Hajishirzi, Luke Zettlemoyer, and
curriculum. arxiv. Marco Tulio Ribeiro. 2023. Art: Automatic multi-
step reasoning and tool-use for large language
Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. models. arxiv.
2023. Toolkengpt: Augmenting frozen language Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E
models with massive tools via tool embeddings. Gonzalez. 2023. Gorilla: Large language model
arxiv. connected with massive apis. arxiv.
Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Archiki Prasad, Alexander Koller, Mareike Hartmann,
Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Peter Clark, Ashish Sabharwal, Mohit Bansal,
Neil Zhenqiang Gong, et al. 2023. Metatool and Tushar Khot. 2023. Adapt: As-needed
benchmark for large language models: Deciding decomposition and planning with language models.
whether to use tools and which to use. arxiv. arxiv.

Qiao Jin, Yifan Yang, Qingyu Chen, and Zhiyong Lu. Chen Qian, Xin Cong, Cheng Yang, Weize Chen,
2023. Genegpt: Augmenting large language models Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong
with domain tools for improved access to biomedical Sun. 2023a. Communicative agents for software
information. arxiv. development. arxiv.
Chen Qian, Yufan Dang, Jiahao Li, Wei Liu, Weize Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei
Chen, Cheng Yang, Zhiyuan Liu, and Maosong Han, Qiao Liang, and Le Sun. 2023. Toolalpaca:
Sun. 2023b. Experiential co-learning of software- Generalized tool learning for language models with
developing agents. arxiv. 3000 simulated cases. arxiv.
Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang,
Wangchunshu Zhou, Yuchen Eleanor Jiang, Chengfei Jiaqi Bai, Qian-Wen Zhang, Zhao Yan, and Zhoujun
Lv, and Huajun Chen. 2024. Autoact: Automatic Li. 2023a. Mac-sql: Multi-agent collaboration for
agent learning from scratch via self-planning. arxiv. text-to-sql. arxiv.
Yujia Qin, Zihan Cai, Dian Jin, Lan Yan, Shihao Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen,
Liang, Kunlun Zhu, Yankai Lin, Xu Han, Ning Lifan Yuan, Hao Peng, and Heng Ji. 2023b. Mint:
Ding, Huadong Wang, Ruobing Xie, Fanchao Qi, Evaluating llms in multi-turn interaction with tools
Zhiyuan Liu, Maosong Sun, and Jie Zhou. 2023a. and language feedback. arxiv.
WebCPM: Interactive web search for Chinese long-
form question answering. In ACL. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh
Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Hajishirzi. 2023c. Self-instruct: Aligning language
Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, models with self-generated instructions. In ACL.
Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su,
Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zihao Wang, Shaofei Cai, Anji Liu, Xiaojian Ma,
Zhu, Shi Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, and Yitao Liang. 2023d. Describe, explain, plan
Yining Ye, Bo Li, Ziwei Tang, Jing Yi, Yu Zhu, and select: Interactive planning with large language
Zhenning Dai, Lan Yan, Xin Cong, Ya-Ting Lu, models enables open-world multi-task agents. arxiv.
Weilin Zhao, Yuxiang Huang, Jun-Han Yan, Xu Han,
Xian Sun, Dahai Li, Jason Phang, Cheng Yang, Chenfei Wu, Sheng-Kai Yin, Weizhen Qi, Xiaodong
Tongshuang Wu, Heng Ji, Zhiyuan Liu, and Maosong Wang, Zecheng Tang, and Nan Duan. 2023. Visual
Sun. 2023b. Tool learning with foundation models. chatgpt: Talking, drawing and editing with visual
arxiv. foundation models. arxiv.
Yujia Qin, Shi Liang, Yining Ye, Kunlun Zhu, Lan Yan, Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen,
Ya-Ting Lu, Yankai Lin, Xin Cong, Xiangru Tang, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao,
Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, and Tianyi Zhou. 2024. A survey on knowledge
Jie Zhou, Marc H. Gerstein, Dahai Li, Zhiyuan Liu, distillation of large language models. arXiv.
and Maosong Sun. 2023c. Toolllm: Facilitating large
language models to master 16000+ real-world apis. Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge,
arxiv. Xiu Li, and Ying Shan. 2023a. Gpt4tools: Teaching
large language model to use tools via self-instruction.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta arxiv.
Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola
Cancedda, and Thomas Scialom. 2023. Toolformer: Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin
Language models can teach themselves to use tools. Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng
arxiv. Liu, Ce Liu, Michael Zeng, and Lijuan Wang.
2023b. Mm-react: Prompting chatgpt for multimodal
Yongliang Shen, Kaitao Song, Xu Tan, Dong Sheng
reasoning and action. arxiv.
Li, Weiming Lu, and Yue Ting Zhuang. 2023.
Hugginggpt: Solving ai tasks with chatgpt and its Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
friends in huggingface. arxiv. Shafran, Karthik R Narasimhan, and Yuan Cao. 2023.
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan React: Synergizing reasoning and acting in language
Scales, David Dohan, Ed H. Chi, Nathanael Schärli, models. In ICLR.
and Denny Zhou. 2023. Large language models can Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang,
be easily distracted by irrelevant context. In PMLR. Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou,
Yifan Song, Weimin Xiong, Dawei Zhu, Chengzu Li, Qi Zhang, Tao Gui, et al. 2024a. Tooleyes: Fine-
Ke Wang, Ye Tian, and Sujian Li. 2023. Restgpt: grained evaluation for tool learning capabilities of
Connecting large language models with real-world large language models in real-world scenarios. arxiv.
applications via restful apis. arxiv.
Junjie Ye, Sixian Li, Guanyu Li, Caishuang Huang,
Qiushi Sun, Zhangyue Yin, Xiang Li, Zhiyong Wu, Songyang Gao, Yilong Wu, Qi Zhang, Tao Gui,
Xipeng Qiu, and Lingpeng Kong. 2023. Corex: and Xuanjing Huang. 2024b. Toolsword: Unveiling
Pushing the boundaries of complex reasoning safety issues of large language models in tool
through multi-model collaboration. arxiv. learning across three stages. arxiv.

Yashar Talebirad and Amirhossein Nadiri. 2023. Multi- Da Yin, Faeze Brahman, Abhilasha Ravichander,
agent collaboration: Harnessing the power of Khyathi Chandu, Kai-Wei Chang, Yejin Choi, and
intelligent llm agents. arxiv. Bill Yuchen Lin. 2023. Lumos: Learning agents with
unified data, modular design, and open-source llms.
arxiv.
Junjie Zhang, Yupeng Hou, Ruobing Xie, Wenqi Sun,
Julian McAuley, Wayne Xin Zhao, Leyu Lin, and
Ji-Rong Wen. 2023. Agentcf: Collaborative learning
with autonomous language agents for recommender
systems. arxiv.
Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun,
and Chao Zhang. 2023. Toolqa: A dataset for llm
question answering with external tools. arxiv.
A Appendix new instruction is then processed by the execution
agent and observing agent sequentially. Finally,
the images are obtained, and the task is completed
A.1 Evaluation Metrics Details successfully. This example shows that our method
can solve complex tasks effectively by synergizing
the three agents.
Automatic evaluation. We mainly employ
Case for Comparison We show a concrete
Success Rate and Correct Path Rate as two
example in Table 5 to compare our method with
automatic evaluation metrics following previous
baseline ReAct (Yao et al., 2023). We find
works (Yang et al., 2023a; Gao et al., 2023b). The
that observe ReAct fails to generate appropriate
Success Rate (Success%) computes the proportion
arguments to execute the tools, even iterating for
of successful query completions. Specifically,
several iterations, and ends up failing to complete
when all the ground-truth tools are executed
the task. In contrast, our ConAgents can utilize the
correctly, the success rate is set to 1 to 0. (2)
required tools correctly and incorporate execution
Correct Path Rate (Path%), which computes the
results to derive a correct answer successfully.
F1 score between the generated tool sequence and
This comparison provides intuitive evidence for
ground-truth tool sequence.
the effectiveness of the proposed cooperative
Human evaluation We conduct a human framework.
evaluation on two metrics, including: (1)
Examples for Iterative Calibration with Tool
Executability (Exec): whether the multiple tools
Server Figure 7 shows an example of our
are invoked in a correct logical order to complete
proposed Iterative Calibration method with the tool
the task; (2) Correct Rate of Parsing (Parsing):
server, where the execution agent calibrates the
whether the agents can extract the relevant values
misaligned arguments following the response from
from lengthy execution results and incorporate
the tool servers, e.g., “the invalid id”, and generates
them to predict the next action. We invite three
the code to extract the relevant value successfully.
well-educated volunteers to evaluate 30 cases
randomly sampled from RestBench-TMDB and Examples for Iterative Calibration with Code
RestBench-Spotify datasets, respectively, with Interpreter Figure 8 shows an example of our
a three-scale rating. The Kappa statistics for proposed Iterative Calibration method with a code
Executability and Correct Rate of Parsing are 0.75 Interpreter, where the observing agent calibrates
and 0.71 on the TMDB dataset, while 0.69 and the programming errors following the feedback
0.73 on the Spotify dataset, illustrate agreement from the code interpreter, e.g., the TypeError, and
among the annotators. generates the code to extract the relevant value
successfully.
A.2 Case Study
We conduct several case studies and find that our A.3 Datasets
method is effective at executing various tools and Experiment dataset We conduct experiments on
incorporating execution results to solve the input three commonly used datasets with tool learning
tasks. tasks, including:
Case for our ConAgents. Figure 6 presents a • RestBench (Song et al., 2023): a high-quality
concrete example of the workflow of our proposed human annotated dataset consisting of 54
cooperative framework. Given a task “Give me tools about movie scenarios.
a photo belonging to the second episode of the
• RestBench-Spotify (Song et al., 2023): a
first season of The Witcher”, the grounding agent
dataset with 40 tools for music scenarios.
generates a tool-use instruction which instructs
the execution agent to execute “search_tv” tool • ToolBench (Qin et al., 2023c): a dataset
and instruct the observing agent to extract the “id” containing diverse real-world tools across
of the TV show. In the next turn, the extracted various applications, which contains the
“id (71912)” is incorporated by to grounding simple tasks, i.e., solving a task with one
agent to generate a new instruction, i.e., “use the single tool, and complex tasks, i.e., executing
season_episode_images to get the images”. The multiple tools in a logic order to solve a task.
dataset RestBench-TMDB
Query Who is the director of Leonardo DiCaprio’s latest movie?
search_person: get the person id
search_tv: Search for a TV show.
search_movie: Search for movies.
Candidate Tool List person_personID_movie_credits: Get the movie credits for a person, the results contains various
information such as popularity and release date.
tv_tvID_credits: Get the credits (cast and crew) that have been added to a TV show.
movie_top_rated: Get the top rated movies on TMDb.
tv_tvID_reviews: Get the user reviews for a TV show.
search_collection: Get collection details by id.
collection_collectionID_images: Get the images for a collection by id.
movie_movieID_credits: Get the cast and crew for a movie.
Model Output
grounding agent: > > First, use <search_person> to search for the person id of Leonardo DiCaprio.
execution agent: > > {"url": "https://api.themoviedb.org/3/search/person", "params": {"query":
"Leonardo DiCaprio"}, "description": "Search for the person id of Leonardo DiCaprio."}
observing agent: > > The id of the person is 6193
grounding agent: > > Second, using the <person_personID_movie_credits> to get the details of
the movies Leonardo DiCaprio (id 6193) join in
execution agent: > > {"url": "https://api.themoviedb.org/3/person/6193/movie_credits", "descrip-
ConAgents
tion": "The movie credit list of Leonardo DiCaprio."}
observing agent: > > the latest movie is Untitled Barry Jenkins ’Virunga’ Adaptation (718375).
grounding agent: > > Third, use <movie_movieID_credits> to get the director of the movie 718375.
execution agent: > > {"url": "https://api.themoviedb.org/3/movie/718375/credits", "description":
"The director of the movie Untitled Barry Jenkins ’Virunga’ Adaptation."}
observing agent: > > the director of this movie (id 718375) is Barry Jenkins.
grounding agent: > > the director of Leonardo DiCaprio’s latest movie is Barry Jenkins.
Turn 1-1: > > <search_person>
Turn 1-2: > > {"query": "Leonardo DiCaprio","page": 1, "include_adult": "false","region": "US"}
Turn 1-3: > > {"page":1,"results":{"adult":false,"gender":2,"id":6193,"known_for_department":"Acting"...}
Turn 2-1: > > person_personID_movie_credits
Turn 2-2: > > {"person_id": "6193"}
Turn 2-3: > > {"cast":{"adult":false,"backdrop_path":"/m0miVRUF6vmCos.jpg","id":454,"ori... }
Turn 3-1: > > movie_movieID_credits
Turn 3-2: > > {"movie_id": "791373"}
ReAct
Turn 3-3: > > {"id":791373,"cast":{"adult":false,"gender":2,"id":880,"department":"Acting","name"...
Turn 4-1: > > movie_movieID_credits
Turn 4-2: > > {"movie_id": "791373"}
Turn 4-3: > > {"id":791373,"cast":{"adult":false,"gender":2,"id":880,"department":"Acting","name"...
Turn 5-1: > > movie_movieID_credits
Turn 5-2: > > {"movie_id": "791373"}
Turn 5-3: > > {"id":791373,"cast":{"adult":false,"gender":2,"id":880,"department":"Acting","name":"...
Turn 6-1: > > The movie with ID 791373 does not have the director information.

Table 5: An example of model outputs on the RestBench-TMDB dataset, where we highlight the critical task-solving
steps, i.e., select a tool, executing the tool, and incorporating the execution results for the next step. We also high
the right and wrong values extracted by agents from execution results in the overall workflow. This example shows
that our method is more effective at solving a complex task. We represent the same execution result with the same
placeholder (e.g., %s) for a more intuitive explanation.
Can you give me a photo belong Second, use season_episode_images
to the second episode of the to get the images that belong to the
first season of the Witcher? second episode of the first season of
The Witcher (TV id is 71912, season
number is 1, episode number is 2)
Trajectory of our CoAgents

{"url":"https://api.themoviedb.org/3/t
First, please use search_tv tool to v/71912/season/1/episode/2/images",
search for the TV show *id* of "include_image_language": "en,null”,
The Witcher "language": "en"}

{ "url":"https://api.themoviedb.org {"id":1954612,"stills":[{"aspect_ratio":1.778,"hei
/3/search/tv", "query": "The ght":2160,"iso_639_1":null,"file_path":"/7KWyZ
giSXSAuagbtzjTZLrkOtNk.jpg, "width":3840
Witcher", "first_air_date_year":
2019}
results=[]
print('The images are:')
{"page":1,"results":[{"adult":false,"backdrop for e in data:
_path":"/foGkPxpw9h8zln81j63mix5B7m8.jp print('-',e['file_path'])
g","genre_ids":[18,10759,10765],"id":71912," results.append(e['file_path'])
origin_country":["US"],"original_language":"e
n","original_name":"The Witcher","overview":
The images are:
"Geralt of Rivia, a mutated monster-hunter…
- File Path: /7KWyZgiSXrkOtNk.jpg
- File Path: /iU3UoS9EOGh8Zag.jpg
idx = [ e['id'] for e in data
if e['name'] == 'The Witcher’
] The images are:
results="The id of the TV show 1. 7KWyZgiSXrkOtNk.jpg
'The Witcher' is: "+idx 2. iU3UoS9EOGh8Zag.jpg
The id of the TV show 'The Witcher' is 71912
Finish

Iteration 1 Iteration 2~3

Figure 6: An example to illustrate the workflow of our proposed cooperative framework ConAgents, which
synergizes the grounding agent, execution agent, and observing agent for several turns until completing the task.
For each turn, the grounding agent generates tool-use instruction, while the execution agent and observing agent
are employed to execute the select tool and extract relevant values from lengthy execution results, respectively.
CoAgents is short for ConAgents.

We conduct experiments on the full dataset of evaluate the Correct Rate of Parsing. We select
TMDB and Spotify. Due to the intensive inference the two datasets for annotation for the open-source,
cost of LLMs-based agents, we randomly sample free of charge, and high-quality consideration.
100 cases as test sets from the complex tasks in Specifically, we invite four annotators to manually
Toolbench datasets to evaluate the performance solve a task step-by-step using the ground truth
of our cooperative agent framework in solving tools provided by existing datasets. To guarantee
practical tasks. annotation quality, we ask at least two annotators
to solve the tasks in the dataset repeatedly. If
Extend existing datasets. Existing datasets only there is a discrepancy between the two annotators
annotate the required tools to solve the tasks, which (i.e., two annotators give different solutions), we
lacks what arguments to pass, execution results, ask a third annotator to recheck it. Finally, we
and how to incorporate the results step-by-step. We extend the full datasets of TMDB and Spotify, with
first extend the RestBench-TMDB and RestBench- each case containing the required tool names, the
Spotify datasets with a fine-granularity task-solving corresponding arguments, the execution results,
trajectory to support our human evaluation to and the final answer. We randomly sample 30 cases
Algorithm 1: Construct the schema
Use GET_movie_id_similar to get a list of
similar movies to Titanic (movie id is 597) of the execution result with recursive
decomposition.
Iterate Calibration Data: root, indent (default = 0)
with Code Interpreter
Result: The synthetic schema, represented
{ as a list of tuples, i.e., (name, type,
"base_url": "https://api.themoviedb.org/3/",
"page": 1, depth).
"path": "movie/{movie_id}/similar",
"description": "The reviews of the first // depth indicates the hierarchical
movie that is similar to Titanic." structure for each element in a
}
complex variable
{"success":false,"status_code":6,"status_message":" // IsBasic → True/False
Invalid id: The pre-requisite id is invalid or not
found."} // Iterate → Each element
1 Function RecurDecompose(root, depth):
{
"base_url": "https://api.themoviedb.org/3/",
2 name ← root.name
"page": 1, 3 type ← root.type
"path": "movie/597/similar",
"description": "The reviews of the first 4 tree ← List[(name, type, depth)]
movie that is similar to Titanic."
} 5 if IsBasic(root) then
6 return tree
{"page":1,"results":[{"adult":false,"backdrop_path":"/Y
GDLU2kse6BwcAXRPulP4FvxqE.jpg","genre_ids":[35,1 7 foreach child in Iterate(root) do
8,27],"id":369936,"original_language":"ta","original_ti
tle":"டா#லி& II","overview":"A group of friends 8 tmp ← RecurDecompose(child,
goes to a hill station to unwind, but one of them gets
possessed by a ghost turning the holiday into…
depth + 1)
9 tree.append(tmp)

Figure 7: An example to illustrate the workflow of 10 return tree // get the tree
our iterative calibration with tool servers, where the
execution agent calibrates the misaligned arguments
following the error messages from the tool servers and
executes the tools successfully to request the data. in Algorithm 1. To explain more intuitively, we
also illustrated our algorithm with an example in
Figure 10, which demonstrates how to construct the
from two extended datasets, respectively, which are schema (right) based on lengthy execution results
used for our human evaluation. We also release the (left).
extended datasets to facilitate future research. The Synthetic schema paired with execution
results is then fed into observing agent, prompting
A.4 System prompt for Agents the observing agent to generate the code used to
We provide the system prompts to enable the agents extract relevant values. This method can be used
in our ConAgents to perform specialized actions. to deal with unrecorded tools in more general
The system prompt for the grounding agent is scenarios.
shown in Table 6. The system prompt for the
execution agent is shown in Table 7. The system A.6 Efficiency Analysis
prompt for the observing agent is shown in Table 8. We further explore the efficiency of our proposed
ConAgents, where we compare the token
A.5 Synthetic Schema with Recursive
consumption for the ConAgents and strong
Decomposition
baselines ReAct@N using the RestBench-TMDB
We propose a recursive decomposition algorithm to dataset with the same setting in Table 1. And we set
synthesize a schema automatically. Our algorithm the N from 1 to 5 for a comprehensive comparison.
constructs the schema by recursively decomposing We show the frequency histogram for the number
each element in the lengthy execution results, of consumed tokens of different methods Figure 9.
representing the hierarchical structure among We find that our framework spends fewer tokens
elements with the depth of recursion, and listing compared with strong baselines ReAct@N (N≥ 2)
the type of the corresponding value as the schema. with almost the same token consumption as vanilla
A pseudo code of the overall procedure is given ReAct. The potential reason is that the modularized
Prompt to enable the grounding agent
System prompt
You are an agent that accesses external Tools and utilizes them to solve user queries step by step.

Please interleave three steps: Thought, Tool Selection and Execution Result. Here is the format:
User Query: the query of users
Thought: "Specify which tool to use and what value you want to obtain through this tools."
Tool Selection: “Just select a tool from the list. DO NOT change the name of the tool."
Execution Result: "Specify what you want to obtain from the execution results."
...(The Thought, API Selection, and Execution Result can be repeated N times)
Final Answer: Your final answer to this task

Note: (1) Every step of Thought should remain atomic, e.g., keeping it as simple as possible and include only one single tool.
(2) DO NOT use pronouns in the plan but use the corresponding results obtained previously.

Here is an specific examples:


{example}
Input task
Only using the following API to solve the user query.

==API List==
{api_list}

User query: {query} {hidden}

Table 6: The prompt to enable the grounding agent in our cooperative framework. We detail the system prompt and
input prompt, respectively.

Prompt to enable the execution agent


System prompt
You are a helpful assistant. You can access external tools to execute my instruction.

Input prompt
You can access external tools to execute the instruction:
{instruction}

Please complete the arguments and request data from servers based on the tool documentation:
{documentation}

When receiving error messages, you should revise your output and try again.

Here are some examples:


{examples}

Table 7: The prompt to enable the execution agent in our cooperative framework. We detail the system prompt and
input prompt, respectively.

Prompt to enable the observing agent


System prompt
Here is a python “ dict ” called “ response ”. Please write code to extract values from the ‘response‘ following my instruction.

Input prompt
Here is a python “ dict ” called “ response ”. Please write code to extract values from the ‘response‘ following the instruction:
{instruction}

The structure of ‘response‘ is simplified as below:


{schema}

You should revise your code when you receive an exception error from Python interpreter.

Your Output:
“‘python
[Python code]
“‘

Table 8: The prompt to enable the observing agent in our cooperative framework. We detail the system prompt and
input prompt, respectively.
Based on the person id (153), use movies = response.get('cast')
"person_movie_credits" to get the if movies is not None:
movies directed by this person . names = []
ids = []
for movie in movies:
{'url':'https://api.themoviedb.org/3/ names.append(movie['name’])
person/153/movie_credits’, 'page’: 1, ids.append(movie['id’])
'description': 'The movie credit list'} if names and ids:
result = " ".join(names)
print(result)
{"cast":[{"adult":false,"backdrop_path":"/uy
w4efhaBsktg.jpg","genre_ids":[99],"id":686
Error messages:
132,"original_language":"en...
-----------------------------
Exception Type: <class 'TypeError'>
Iterative Calibration Exception Value: sequence item 0: expected
str instance, NoneType found
with Code Interpreter

movies = response['cast'] movies = response.get('cast')


names = [] if movies is not None:
ids = [] names = []
for movie in movies: ids = []
names.append(movie['name']) for movie in movies:
ids.append(movie['id']) names.append(movie[‘title’])
ids.append(movie['id'])
result = "The movies are {} if names and ids:
({})".format(", ".join(names), ", result = [f"{name} {id}”
".join(str(id) for id in ids)) for name, id in zip(names,ids)]
print(result)
Error messages:
-----------------------------
Exception Type: <class 'KeyError'> The Sound of 007 1028506
Exception Value: 'name' Allied Forces: Making 1917 686132
Score: A Film Music Documentary 426410

Figure 8: An example to illustrate the workflow of our iterative calibration with code interpreter, where the observing
agent can revise the programming error following the exceptional error from the interpreter and generate correct
code to extract the relevant values from lengthy content.

Ours ReAct@1 ReAct@2 ReAct@3 ReAct@4 ReAct@5


1.919 1.714 4.261 5.851 6.132 6.545

Table 9: The ∆ Path Len for different methods on the


TMDB dataset.

framework ConAgents enables each agent to


perform specific tasks more efficiently, reducing
the long exploration trajectory encountered by the
single agent.
In Table 9, we further compute the ∆ Path Len
for each method following previous works Song
et al. (2023) and Yang et al. (2023a). The ∆ Path
Len metric evaluates an average number of extra
tools involved in solving a task. We find that our
method solves a task using fewer steps than strong
baseline ReAct@N (N≥ 2). The decrease in the
∆ Path Len further proves the effectiveness of our
cooperative agent framework.
1.2 1e-4 1.6 1e-4 1.6 1e-4

0.9 : 12521.73 1.2 : 12065.80 1.2 : 14813.26


Frequency

Frequency

Frequency
0.6 0.8 0.8

0.3 0.4 0.4

0.0 10000 20000 30000 40000 0.0 10000 20000 30000 0.0 10000 20000 30000
Token Consumption of ours Token Consumption of ReAct@1 Token Consumption of ReAct@2
1.2 1e-4 1.00 1e-4 1.00 1e-4

0.9 : 21883.48 0.75 : 22168.64 0.75 : 23533.13


Frequency

Frequency

Frequency

0.6 0.50 0.50

0.3 0.25 0.25

0.0 0 25000 50000 75000 0.00 0 25000 50000 75000 0.00 0 20000 40000 60000
Token Consumed of ReAct@3 Token Consumption of ReAct@4 Token Consumption of ReAct@5

Figure 9: The efficiency analysis for our method and ReAct@N baselines (1 ≤ N ≤ 5), where we count the
distribution of consumed tokens cost by each case and compute the average consumption µ.
- response: dict
{“album_type”: “album”, “artists”: [{“name”: “Pitbull”, “type”: “artist”, “uri”: - album_type: str
“spotify:artist:0TnOYISbd1XYRBk9myaseg”}], “copyrights”: [{“text”: “(P) 2012 - artists: list
RCA Records, a division of Sony Music Entertainment”, “type”: “P”}], “id”: - artists[0]: dict
“4aawyAB9vmqN3uQ7FjRGTy”, - name: str
“images”:[{“height”:640,“url”:“https://i.scdn.co/image/ab67616d0000b2732c5b2 - type: str
4ecfa39523a75c993c4”,“width”:640},{“height”:64,“url”:“https://i.scdn.co/image - uri: str
/ab67616d00048512c5b24ecfa39523a75c993c4”, “width”: 64}], “is_playable”: - copyrights: list
true, “label”: “Mr.305/Polo Grounds Music/RCA Records”, “name”: “Global - copyrights[0]: dict
Warming”, “popularity”: 57, “release_date”: “2012-11-16”, “release_d - text: str
ate_precision”: “day”, - type: str
“total_tracks”:18,“tracks”:{“href”:“https://api.spotify.com/v1/albums/4aawyAB9 - id: str
vmqN3uQ7FjRGTy/tracks?offset=0&limit=50&locale=zh-CN,zh;q=0.9”, “items”: - images: list
[{“artists”: [{“external_urls”: {“spotify”: - images[0]: dict
“artist/0TnOYISbd1XYRBk9myaseg”},“href”:“https://api.spotify.com/v1/artists - height: int
/0TnOYISbd1XYRBk9myaseg”, “id”: “0TnOYISbd1XYRBk9myaseg”, - url: str
“name”:“Pitbull”,“type”:“artist”,“uri”:“spotify:artist:0TnOYISbd1XYRBk9myaseg”} - width: int
,{“external_urls”:{“spotify”:“artist/7iJrDbKM5fEkGdm5kpjFzS”},“href”:“https:/ - is_playable: bool
/api.spotify.com/v1/artists/7iJrDbKM5fEkGdm5kpjFzS”, “id”: - label: str
“7iJrDbKM5fEkGdm5kpjFzS”, “name”: “Sensato”, “type”: “artist”, “uri”: - name: str
“spotify:artist:7iJrDbKM5fEkGdm5kpjFzS”}], - popularity: int
“disc_number”:1,“explicit”:true,“external_urls”:{“spotify”:“track/6OmhkSOpvYBo - release_date: str
kMKQxpIGx2”},“href”:“https://api.spotify.com/v1/tracks/6OmhkSOpvYBokMK - release_date_precision: str
QxpIGx2”, “id”: “6OmhkSOpvYBokMKQxpIGx2”, “restrictions”: {“reason”: - total_tracks: int
“age_verification_required”}}, - tracks: dict
{“artists”:[{“external_urls”:{“spotify”:“artist/0TnOYISbd1XYRBk9myaseg”},“hre - href: str
f”:“https://api.spotify.com/v1/artists/0TnOYISbd1XYRBk9myaseg”, “id”: - items: list
“0TnOYISbd1XYRBk9myaseg”, “name”: “Pitbull”, “type”: “artist”, “uri”: - items[0]: dict
“spotify:artist:0TnOYISbd1XYRBk9myaseg”}, {“external_urls”: {“spotify”: - artists: list
“artist/2L8yW8GIoirHEdeW4bWQXq”}, - artists[0]: dict
“href”:“https://api.spotify.com/v1/artists/2L8yW8GIoirHEdeW4bWQXq”, “id”: - external_urls: dict
“2L8yW8GIoirHEdeW4bWQXq”, “name”: “TJR”, “type”: “artist”, “uri”: - spotify: str
“spotify:artist:2L8yW8GIoirHEdeW4bWQXq”}], “disc_number”: 1, - href: str
“duration_ms”: 206120, “explicit”: false, “external_urls”: - id: str
{“spotify”:“track/2iblMMIgSznA464mNov7A8”},“href”:“https://api.spotify.com/ - name: str
v1/tracks/2iblMMIgSznA464mNov7A8”, “id”: “2iblMMIgSznA464mNov7A8”, - type: str
“restrictions”: {}}, {“artists”: [{“external_urls”: {“spotify”: - uri: str
“artist/0TnOYISbd1XYRBk9myaseg”}, - disc_number: int
“href”:“https://api.spotify.com/v1/artists/0TnOYISbd1XYRBk9myaseg”, “id”: - explicit: bool
“0TnOYISbd1XYRBk9myaseg”, “name”: “Pitbull”, “type”: “artist”, “uri”: - external_urls: dict
“spotify:artist:0TnOYISbd1XYRBk9myaseg”}, {“external_urls”: {“spotify”: - spotify: str
“artist/1l7ZsJRRS8wlW3WfJfPfNS”}, “href”: - href: str
"https://api.spotify.com/v1/artists/1l7ZsJRRS8wlW3WfJfPfNS", "id": - id: str
"1l7ZsJRRS8wlW3WfJfPfNS", "name": "Christina Aguilera", "type": "artist", - restrictions: dict
"uri": "spotify:artist:1l7ZsJRRS8wlW3WfJfPfNS"}], "disc_number": 1, - reason: str
"duration_ms": 229506, "explicit": false, "external_urls": {"spotify": - limit: int
"track/4yOn1TEcfsKHUJCL2h1r8I"}, "href" - next: NoneType
"https://api.spotify.com/v1/tracks/4yOn1TEcfsKHUJCL2h1r8I", "id": - offset: int
"4yOn1TEcfsKHUJCL2h1r8I"}], "limit": 50, "next": null, "offset": 0, "previous": - previous: NoneType
null, "total": 18}, "type": "album", "uri": - total: int
"spotify:album:4aawyAB9vmqN3uQ7FjRGTy"} - type: str
- uri: str

(a) The lengthy execution results (b) The pseudo schema

Figure 10: A concrete example to demonstrate how to construct a schema for the tool response, where we recursively
decompose each element in the execution result, represent the hierarchical structure via indent, and list the type of
the corresponding values.
B Reviewer PHS5 (S: 3, O:3) quite similar to RestGPT. The introduction of
the Observing agent for processing lengthy tool
Weakness 1: The approach of dividing an agent
responses and the IterCali for error correction
into a planning agent, a tool-invocation agent, plus
does present some advancement; however, the
additional agents is not novel, there are several
overarching architecture’s resemblance to existing
similar works:
models raises questions about the framework’s
[1] Song Y, Xiong W, Zhu D, et al. Restgpt: novelty.
Connecting large language models with real-world
applications via restful apis[J]. arXiv preprint Response:
arXiv:2306.06624, 2023. Weakness 2 The IterCali, although crucial for
[2] Shen W, Li C, Chen H, et al. Small llms error correction and adaptability, bears similarity to
are weak tool learners: A multi-llm agent[J]. arXiv ReAct’s feedback-based action loop. The concept
preprint arXiv:2401.07324, 2024. itself is not that new. Additionally, the revise-
[3] Qian C, Dang Y, Li J, et al. Experiential co- based-on-feedback mechanism has been discussed
learning of software-developing agents[J]. arXiv in many previous works (e.g., Reflexion and
preprint arXiv:2312.17025, 2023. Generative Agents). I tend to not call it novel after
Response: around 1 year after the release of Reflexion and
Generative Agents. It is more like a common trick.
Weakness 2: The issue of "the limitation
of the inherent capability of a single LLM Response:
to perform diverse actions" proposed in this Weakness 3 The choice of baselines in the
paper, has not been well analyzed. The paper analysis experiments only includes ReAct, which,
merely demonstrates that their proposed framework while relevant, restricts the comparative analysis’s
outperforms other frameworks. However, since breadth. The absence of a direct comparison with
the underlying LLM for all agents within this RestGPT and ToolLLM, especially considering
framework remains the same single LLM, the the conceptual similarities with RestGPT, might
experimental results do not actually illustrate the lead to an incomplete evaluation of the proposed
limitations of a single LLM’s capabilities. A framework’s performance and efficiency.
possible method to verify this may be using a
single-agent framework, where the system prompt Response:
of this agent simultaneously describes the functions Weakness 4 Although stating that the observer is
of the grounding agent, execution agent, and "schema-free", it heavily relies on the assumption
observing agent, and then sequentially execute this that the tool responses are in JSON or other similar
single agent as the process introduced in this paper. structured formats. Compared to not using an
observer, the generalizability is decreased while
Response:
the performance is improved.
C Reviewer djSr (S:3.5, O:3) Response:
Weakness 1 authors don’t discuss much of the Overall Weakness I acknowledge that this paper
characteristics of the datasets and their complexity is "solid" in the sense of the comprehensive
experimental work presented. However, the
Response:
architectural resemblance between ConAgents and
Weakness 2 human evaluation is small in scale RestGPT raises concerns about the novelty and
(3 annotators only) significance of the proposed modifications. The
incremental modification of the architecture seems
Response: trivial to me. I understand that the research in
agent-based systems is fast-paced, but introducing
D Reviewer YkaU (S: 2.5, O: 2.5)
some incremental changes to an architecture that
Weakness 1 The core concept of modularizing was established more than half a year ago and not
the tool learning process into specialized agents showing really significant improvement is not so
shares similarities with prior works such as appealing to me. In light of these considerations,
RestGPT. The overall architecture design is also I’m inclined to a weak rejection.
Response:
Comments In Table 1, ToolLLM is the best at
the Path% of ToolBench, but it is not bolded. And
in the Table 1 ablation study, it should be M_O
instead of M_E?
Response:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy