Learning To Use Tools Via Cooperative and Interactive Agents
Learning To Use Tools Via Cooperative and Interactive Agents
Zhengliang Shi1 Shen Gao1 Xiuyi Chen2 Lingyong Yan2 Haibo Shi2
Dawei Yin2 Zhumin Chen1 Pengjie Ren1 Suzan Verberne3 Zhaochun Ren3∗
1
Shandong University, Qingdao, China 2 Baidu Inc., Beijing, China
3
Leiden University, Leiden, The Netherlands
shizhl@mail.sdu.edu.cn, z.ren@liacs.leidenuniv.nl
Question: Can you tell me what is the latest movie in December 24, 2023 directed
Abstract by Lilly Wachowski?
Instruction
(2)
their capability. Existing methods employ one (3) Result
Execution
Target value
single LLM-based agent to iteratively select Incorporation Tool
server ❌ ❌
and execute tools, thereafter incorporating
Success
the result into the next action prediction. Code Results Args
Executed (3) (2)
results Observation Execution
However, they still suffer from potential ❌ Failure
Code Tool
performance degradation when addressing Skip Interpreter server
(3) Observation
Target Code for
First, use <Search TV>
Values extraction
tool to search TV directed
by Lilly Wachowski in Calibration
Dec 24, 2023 Observing
Grounding Code Exceptional Agent
Agent Interpreter Error
Figure 2: Our cooperative and interactive agent framework ConAgents. The grounding agent generates tool-use
instructions while execution agent executes selected tools. The observing agent extracts relevant values from
execution results. When receiving feedback from the tool environment, the execution agent and observing agent
adapt themselves iteratively.
Table 1: The results on three datasets. The Path% indicate the Correct Path Rate metric. The icon denotes the
single-agent method and ² symbolizes multi-agent architecture. The bottom of the table show the ablation study.
Table 3 shows the results of the human evaluation. 5.4 Case Study
We find that ConAgents achieves the best in We conduct the case studies and find that our
the Executability aspect with 0.20~0.31 absolute proposed framework is more effective at executing
improvement compared with the strong baseline various tools and incorporating the results to solve
RestGPT. We also observe our method achieves a complex task. We also provide concrete examples
comparable and even better performance in Correct to intuitively explain the detailed calibration
Rate of Parsing compared with RestGPT which process. The details can be found in Appendix A.2.
uses pre-defined schema. This result illustrates that
our observing agent can efficiently extract relevant 6 Discussions
value by programming with synthetic schema. The
Kappa statistics for Executability and Correct Rate Qualitative analysis for the maximum number
of Parsing are 0.75 and 0.71 on TMDB dataset, of iterations. The IterCali method prompts the
while 0.69 and 0.73 on Spotify dataset, illustrating execution agent and observing agent to iteratively
agreement among the annotators. calibrate themselves based on feedback from
the tool environment. To further explore the
5.3 Ablation Study impact of the maximum iteration number on the
To better understand the impact of different performance of the overall framework, we conduct
components of our method, we employ the the qualitative analysis for the hyper-parameters
following modifications to the architecture. α and β. Specifically, we alternate the α and β
- w/o MO . We replace the observing agent in from 1 to 5, respectively. Then we evaluate our
our cooperative agent framework with directly method using the RestBench-TMDB dataset with
truncating. We observe a 9.60 average decrease in the same setting as Table 1 and show the results
the Success Rate on three datasets, e.g., dropping for each iteration number in Figure 4. We observe
Ours Ours w/o Iteration 1.2 1e-4 1.6 1e-4
Frequency
Frequency
Correct Path Rate
Success Rate
74 75 0.6 0.8
66 65 0.3 0.4
58 55 0.0 10000 20000 30000 40000 0.0 10000 20000 30000
50 1 2 3 4 5 45 1 2 3 4 5 Token Consumption of ours Token Consumption of ReAct@1
Iteration number Iteration number 1.6 1e-4 1.2 1e-4
82 : 1 5, = 3 85 : 1 5, = 3
Correct Path Rate
1.2 : 14813.26 0.9 : 21883.48
Frequency
Frequency
Success Rate
74 75 0.8 0.6
66 65
0.4 0.3
58 55
0.0 0.0 0
50 1 2 3 4 5 45 1 2 3 4 5 10000 20000 30000 25000 50000 75000
Token Consumption of ReAct@2 Token Consumed of ReAct@3
Iteration number Iteration number
Figure 5: The efficiency analysis for different methods,
Figure 4: The qualitative analysis for the maximum
where we count the distribution of consumed tokens and
iteration number α (Section 3.5) and β (Section 3.5) in
compute the average consumption µ.
our proposed IterCali method using TMDB dataset.
Ethics Statement
The paper proposes a cooperative agent framework,
synergizing modularized agents to solve a complex
task. The modularized design enables the agents
to utilize feedback from the tool environment to
calibrate themselves adaptively. All the tools used
in our experiment are provided by open-source
platforms, including TMDB, Spotify, and Rapid
API.
References Geunwoo Kim, Pierre Baldi, and Stephen McAleer.
2023. Language models can solve computer tasks.
Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, arxiv.
and Denny Zhou. 2023. Large language models as
tool makers. arxiv. Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song,
Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang,
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, and Yongbin Li. 2023. Api-bank: A comprehensive
Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, benchmark for tool-augmented llms. In EMNLP.
Cunxiang Wang, Yidong Wang, et al. 2023. A survey
on evaluation of large language models. ACM T Intel Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang,
Syst Tec. Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and
Shuming Shi. 2023. Encouraging divergent thinking
Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, in large language models through multi-agent debate.
Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, arxiv.
Yaxi Lu, Ruobing Xie, et al. 2023. Agentverse:
Facilitating multi-agent collaboration and exploring Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and
emergent behaviors in agents. arxiv. Diyi Yang. 2023. Dynamic llm-agent network: An
llm-agent collaboration framework with agent team
Roi Cohen, May Hamri, Mor Geva, and Amir optimization. arxiv.
Globerson. 2023. Lm vs lm: Detecting factual errors
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-
via cross examination. arxiv.
Wei Chang, Ying Nian Wu, Song-Chun Zhu, and
Jianfeng Gao. 2023. Chameleon: Plug-and-play
Yilun Du, Shuang Li, Antonio Torralba, Joshua B compositional reasoning with large language models.
Tenenbaum, and Igor Mordatch. 2023. Improving arxiv.
factuality and reasoning in language models through
multiagent debate. arxiv. Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das,
Daniel Khashabi, and Hannaneh Hajishirzi. 2023.
Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine When not to trust language models: Investigating
Li, Liwei Jian, Bill Yuchen Lin, Peter West, Chandra effectiveness of parametric and non-parametric
Bhagavatula, Ronan Le Bras, Jena D Hwang, et al. memories. In ACL.
2023. Faith and fate: Limits of transformers on
compositionality. arxiv. Amirkeivan Mohtashami, Florian Hartmann, Sian
Gooding, Lukas Zilka, Matt Sharifi, et al. 2023.
Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Social learning: Towards collaborative learning with
2023. Improving language model negotiation with large language models. arxiv.
self-play and in-context learning from ai feedback.
arxiv. Reiichiro Nakano, Jacob Hilton, S. Arun Balaji, Jeff Wu,
Long Ouyang, Christina Kim, Christopher Hesse,
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Shantanu Jain, Vineet Kosaraju, William Saunders,
Pengfei Liu, Yiming Yang, Jamie Callan, and Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen
Graham Neubig. 2023a. PAL: Program-aided Krueger, Kevin Button, Matthew Knight, Benjamin
language models. In PMLR. Chess, and John Schulman. 2021. Webgpt: Browser-
assisted question-answering with human feedback.
Shen Gao, Zhengliang Shi, Minghang Zhu, Bowen arxiv.
Fang, Xin Xin, Pengjie Ren, Zhumin Chen, and
Jun Ma. 2023b. Confucius: Iterative tool learning Bhargavi Paranjape, Scott M. Lundberg, Sameer
from introspection feedback by easy-to-difficult Singh, Hanna Hajishirzi, Luke Zettlemoyer, and
curriculum. arxiv. Marco Tulio Ribeiro. 2023. Art: Automatic multi-
step reasoning and tool-use for large language
Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. models. arxiv.
2023. Toolkengpt: Augmenting frozen language Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E
models with massive tools via tool embeddings. Gonzalez. 2023. Gorilla: Large language model
arxiv. connected with massive apis. arxiv.
Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Archiki Prasad, Alexander Koller, Mareike Hartmann,
Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Peter Clark, Ashish Sabharwal, Mohit Bansal,
Neil Zhenqiang Gong, et al. 2023. Metatool and Tushar Khot. 2023. Adapt: As-needed
benchmark for large language models: Deciding decomposition and planning with language models.
whether to use tools and which to use. arxiv. arxiv.
Qiao Jin, Yifan Yang, Qingyu Chen, and Zhiyong Lu. Chen Qian, Xin Cong, Cheng Yang, Weize Chen,
2023. Genegpt: Augmenting large language models Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong
with domain tools for improved access to biomedical Sun. 2023a. Communicative agents for software
information. arxiv. development. arxiv.
Chen Qian, Yufan Dang, Jiahao Li, Wei Liu, Weize Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei
Chen, Cheng Yang, Zhiyuan Liu, and Maosong Han, Qiao Liang, and Le Sun. 2023. Toolalpaca:
Sun. 2023b. Experiential co-learning of software- Generalized tool learning for language models with
developing agents. arxiv. 3000 simulated cases. arxiv.
Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang,
Wangchunshu Zhou, Yuchen Eleanor Jiang, Chengfei Jiaqi Bai, Qian-Wen Zhang, Zhao Yan, and Zhoujun
Lv, and Huajun Chen. 2024. Autoact: Automatic Li. 2023a. Mac-sql: Multi-agent collaboration for
agent learning from scratch via self-planning. arxiv. text-to-sql. arxiv.
Yujia Qin, Zihan Cai, Dian Jin, Lan Yan, Shihao Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen,
Liang, Kunlun Zhu, Yankai Lin, Xu Han, Ning Lifan Yuan, Hao Peng, and Heng Ji. 2023b. Mint:
Ding, Huadong Wang, Ruobing Xie, Fanchao Qi, Evaluating llms in multi-turn interaction with tools
Zhiyuan Liu, Maosong Sun, and Jie Zhou. 2023a. and language feedback. arxiv.
WebCPM: Interactive web search for Chinese long-
form question answering. In ACL. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh
Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Hajishirzi. 2023c. Self-instruct: Aligning language
Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, models with self-generated instructions. In ACL.
Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su,
Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zihao Wang, Shaofei Cai, Anji Liu, Xiaojian Ma,
Zhu, Shi Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, and Yitao Liang. 2023d. Describe, explain, plan
Yining Ye, Bo Li, Ziwei Tang, Jing Yi, Yu Zhu, and select: Interactive planning with large language
Zhenning Dai, Lan Yan, Xin Cong, Ya-Ting Lu, models enables open-world multi-task agents. arxiv.
Weilin Zhao, Yuxiang Huang, Jun-Han Yan, Xu Han,
Xian Sun, Dahai Li, Jason Phang, Cheng Yang, Chenfei Wu, Sheng-Kai Yin, Weizhen Qi, Xiaodong
Tongshuang Wu, Heng Ji, Zhiyuan Liu, and Maosong Wang, Zecheng Tang, and Nan Duan. 2023. Visual
Sun. 2023b. Tool learning with foundation models. chatgpt: Talking, drawing and editing with visual
arxiv. foundation models. arxiv.
Yujia Qin, Shi Liang, Yining Ye, Kunlun Zhu, Lan Yan, Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen,
Ya-Ting Lu, Yankai Lin, Xin Cong, Xiangru Tang, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao,
Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, and Tianyi Zhou. 2024. A survey on knowledge
Jie Zhou, Marc H. Gerstein, Dahai Li, Zhiyuan Liu, distillation of large language models. arXiv.
and Maosong Sun. 2023c. Toolllm: Facilitating large
language models to master 16000+ real-world apis. Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge,
arxiv. Xiu Li, and Ying Shan. 2023a. Gpt4tools: Teaching
large language model to use tools via self-instruction.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta arxiv.
Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola
Cancedda, and Thomas Scialom. 2023. Toolformer: Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin
Language models can teach themselves to use tools. Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng
arxiv. Liu, Ce Liu, Michael Zeng, and Lijuan Wang.
2023b. Mm-react: Prompting chatgpt for multimodal
Yongliang Shen, Kaitao Song, Xu Tan, Dong Sheng
reasoning and action. arxiv.
Li, Weiming Lu, and Yue Ting Zhuang. 2023.
Hugginggpt: Solving ai tasks with chatgpt and its Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
friends in huggingface. arxiv. Shafran, Karthik R Narasimhan, and Yuan Cao. 2023.
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan React: Synergizing reasoning and acting in language
Scales, David Dohan, Ed H. Chi, Nathanael Schärli, models. In ICLR.
and Denny Zhou. 2023. Large language models can Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang,
be easily distracted by irrelevant context. In PMLR. Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou,
Yifan Song, Weimin Xiong, Dawei Zhu, Chengzu Li, Qi Zhang, Tao Gui, et al. 2024a. Tooleyes: Fine-
Ke Wang, Ye Tian, and Sujian Li. 2023. Restgpt: grained evaluation for tool learning capabilities of
Connecting large language models with real-world large language models in real-world scenarios. arxiv.
applications via restful apis. arxiv.
Junjie Ye, Sixian Li, Guanyu Li, Caishuang Huang,
Qiushi Sun, Zhangyue Yin, Xiang Li, Zhiyong Wu, Songyang Gao, Yilong Wu, Qi Zhang, Tao Gui,
Xipeng Qiu, and Lingpeng Kong. 2023. Corex: and Xuanjing Huang. 2024b. Toolsword: Unveiling
Pushing the boundaries of complex reasoning safety issues of large language models in tool
through multi-model collaboration. arxiv. learning across three stages. arxiv.
Yashar Talebirad and Amirhossein Nadiri. 2023. Multi- Da Yin, Faeze Brahman, Abhilasha Ravichander,
agent collaboration: Harnessing the power of Khyathi Chandu, Kai-Wei Chang, Yejin Choi, and
intelligent llm agents. arxiv. Bill Yuchen Lin. 2023. Lumos: Learning agents with
unified data, modular design, and open-source llms.
arxiv.
Junjie Zhang, Yupeng Hou, Ruobing Xie, Wenqi Sun,
Julian McAuley, Wayne Xin Zhao, Leyu Lin, and
Ji-Rong Wen. 2023. Agentcf: Collaborative learning
with autonomous language agents for recommender
systems. arxiv.
Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun,
and Chao Zhang. 2023. Toolqa: A dataset for llm
question answering with external tools. arxiv.
A Appendix new instruction is then processed by the execution
agent and observing agent sequentially. Finally,
the images are obtained, and the task is completed
A.1 Evaluation Metrics Details successfully. This example shows that our method
can solve complex tasks effectively by synergizing
the three agents.
Automatic evaluation. We mainly employ
Case for Comparison We show a concrete
Success Rate and Correct Path Rate as two
example in Table 5 to compare our method with
automatic evaluation metrics following previous
baseline ReAct (Yao et al., 2023). We find
works (Yang et al., 2023a; Gao et al., 2023b). The
that observe ReAct fails to generate appropriate
Success Rate (Success%) computes the proportion
arguments to execute the tools, even iterating for
of successful query completions. Specifically,
several iterations, and ends up failing to complete
when all the ground-truth tools are executed
the task. In contrast, our ConAgents can utilize the
correctly, the success rate is set to 1 to 0. (2)
required tools correctly and incorporate execution
Correct Path Rate (Path%), which computes the
results to derive a correct answer successfully.
F1 score between the generated tool sequence and
This comparison provides intuitive evidence for
ground-truth tool sequence.
the effectiveness of the proposed cooperative
Human evaluation We conduct a human framework.
evaluation on two metrics, including: (1)
Examples for Iterative Calibration with Tool
Executability (Exec): whether the multiple tools
Server Figure 7 shows an example of our
are invoked in a correct logical order to complete
proposed Iterative Calibration method with the tool
the task; (2) Correct Rate of Parsing (Parsing):
server, where the execution agent calibrates the
whether the agents can extract the relevant values
misaligned arguments following the response from
from lengthy execution results and incorporate
the tool servers, e.g., “the invalid id”, and generates
them to predict the next action. We invite three
the code to extract the relevant value successfully.
well-educated volunteers to evaluate 30 cases
randomly sampled from RestBench-TMDB and Examples for Iterative Calibration with Code
RestBench-Spotify datasets, respectively, with Interpreter Figure 8 shows an example of our
a three-scale rating. The Kappa statistics for proposed Iterative Calibration method with a code
Executability and Correct Rate of Parsing are 0.75 Interpreter, where the observing agent calibrates
and 0.71 on the TMDB dataset, while 0.69 and the programming errors following the feedback
0.73 on the Spotify dataset, illustrate agreement from the code interpreter, e.g., the TypeError, and
among the annotators. generates the code to extract the relevant value
successfully.
A.2 Case Study
We conduct several case studies and find that our A.3 Datasets
method is effective at executing various tools and Experiment dataset We conduct experiments on
incorporating execution results to solve the input three commonly used datasets with tool learning
tasks. tasks, including:
Case for our ConAgents. Figure 6 presents a • RestBench (Song et al., 2023): a high-quality
concrete example of the workflow of our proposed human annotated dataset consisting of 54
cooperative framework. Given a task “Give me tools about movie scenarios.
a photo belonging to the second episode of the
• RestBench-Spotify (Song et al., 2023): a
first season of The Witcher”, the grounding agent
dataset with 40 tools for music scenarios.
generates a tool-use instruction which instructs
the execution agent to execute “search_tv” tool • ToolBench (Qin et al., 2023c): a dataset
and instruct the observing agent to extract the “id” containing diverse real-world tools across
of the TV show. In the next turn, the extracted various applications, which contains the
“id (71912)” is incorporated by to grounding simple tasks, i.e., solving a task with one
agent to generate a new instruction, i.e., “use the single tool, and complex tasks, i.e., executing
season_episode_images to get the images”. The multiple tools in a logic order to solve a task.
dataset RestBench-TMDB
Query Who is the director of Leonardo DiCaprio’s latest movie?
search_person: get the person id
search_tv: Search for a TV show.
search_movie: Search for movies.
Candidate Tool List person_personID_movie_credits: Get the movie credits for a person, the results contains various
information such as popularity and release date.
tv_tvID_credits: Get the credits (cast and crew) that have been added to a TV show.
movie_top_rated: Get the top rated movies on TMDb.
tv_tvID_reviews: Get the user reviews for a TV show.
search_collection: Get collection details by id.
collection_collectionID_images: Get the images for a collection by id.
movie_movieID_credits: Get the cast and crew for a movie.
Model Output
grounding agent: > > First, use <search_person> to search for the person id of Leonardo DiCaprio.
execution agent: > > {"url": "https://api.themoviedb.org/3/search/person", "params": {"query":
"Leonardo DiCaprio"}, "description": "Search for the person id of Leonardo DiCaprio."}
observing agent: > > The id of the person is 6193
grounding agent: > > Second, using the <person_personID_movie_credits> to get the details of
the movies Leonardo DiCaprio (id 6193) join in
execution agent: > > {"url": "https://api.themoviedb.org/3/person/6193/movie_credits", "descrip-
ConAgents
tion": "The movie credit list of Leonardo DiCaprio."}
observing agent: > > the latest movie is Untitled Barry Jenkins ’Virunga’ Adaptation (718375).
grounding agent: > > Third, use <movie_movieID_credits> to get the director of the movie 718375.
execution agent: > > {"url": "https://api.themoviedb.org/3/movie/718375/credits", "description":
"The director of the movie Untitled Barry Jenkins ’Virunga’ Adaptation."}
observing agent: > > the director of this movie (id 718375) is Barry Jenkins.
grounding agent: > > the director of Leonardo DiCaprio’s latest movie is Barry Jenkins.
Turn 1-1: > > <search_person>
Turn 1-2: > > {"query": "Leonardo DiCaprio","page": 1, "include_adult": "false","region": "US"}
Turn 1-3: > > {"page":1,"results":{"adult":false,"gender":2,"id":6193,"known_for_department":"Acting"...}
Turn 2-1: > > person_personID_movie_credits
Turn 2-2: > > {"person_id": "6193"}
Turn 2-3: > > {"cast":{"adult":false,"backdrop_path":"/m0miVRUF6vmCos.jpg","id":454,"ori... }
Turn 3-1: > > movie_movieID_credits
Turn 3-2: > > {"movie_id": "791373"}
ReAct
Turn 3-3: > > {"id":791373,"cast":{"adult":false,"gender":2,"id":880,"department":"Acting","name"...
Turn 4-1: > > movie_movieID_credits
Turn 4-2: > > {"movie_id": "791373"}
Turn 4-3: > > {"id":791373,"cast":{"adult":false,"gender":2,"id":880,"department":"Acting","name"...
Turn 5-1: > > movie_movieID_credits
Turn 5-2: > > {"movie_id": "791373"}
Turn 5-3: > > {"id":791373,"cast":{"adult":false,"gender":2,"id":880,"department":"Acting","name":"...
Turn 6-1: > > The movie with ID 791373 does not have the director information.
Table 5: An example of model outputs on the RestBench-TMDB dataset, where we highlight the critical task-solving
steps, i.e., select a tool, executing the tool, and incorporating the execution results for the next step. We also high
the right and wrong values extracted by agents from execution results in the overall workflow. This example shows
that our method is more effective at solving a complex task. We represent the same execution result with the same
placeholder (e.g., %s) for a more intuitive explanation.
Can you give me a photo belong Second, use season_episode_images
to the second episode of the to get the images that belong to the
first season of the Witcher? second episode of the first season of
The Witcher (TV id is 71912, season
number is 1, episode number is 2)
Trajectory of our CoAgents
{"url":"https://api.themoviedb.org/3/t
First, please use search_tv tool to v/71912/season/1/episode/2/images",
search for the TV show *id* of "include_image_language": "en,null”,
The Witcher "language": "en"}
{ "url":"https://api.themoviedb.org {"id":1954612,"stills":[{"aspect_ratio":1.778,"hei
/3/search/tv", "query": "The ght":2160,"iso_639_1":null,"file_path":"/7KWyZ
giSXSAuagbtzjTZLrkOtNk.jpg, "width":3840
Witcher", "first_air_date_year":
2019}
results=[]
print('The images are:')
{"page":1,"results":[{"adult":false,"backdrop for e in data:
_path":"/foGkPxpw9h8zln81j63mix5B7m8.jp print('-',e['file_path'])
g","genre_ids":[18,10759,10765],"id":71912," results.append(e['file_path'])
origin_country":["US"],"original_language":"e
n","original_name":"The Witcher","overview":
The images are:
"Geralt of Rivia, a mutated monster-hunter…
- File Path: /7KWyZgiSXrkOtNk.jpg
- File Path: /iU3UoS9EOGh8Zag.jpg
idx = [ e['id'] for e in data
if e['name'] == 'The Witcher’
] The images are:
results="The id of the TV show 1. 7KWyZgiSXrkOtNk.jpg
'The Witcher' is: "+idx 2. iU3UoS9EOGh8Zag.jpg
The id of the TV show 'The Witcher' is 71912
Finish
Figure 6: An example to illustrate the workflow of our proposed cooperative framework ConAgents, which
synergizes the grounding agent, execution agent, and observing agent for several turns until completing the task.
For each turn, the grounding agent generates tool-use instruction, while the execution agent and observing agent
are employed to execute the select tool and extract relevant values from lengthy execution results, respectively.
CoAgents is short for ConAgents.
We conduct experiments on the full dataset of evaluate the Correct Rate of Parsing. We select
TMDB and Spotify. Due to the intensive inference the two datasets for annotation for the open-source,
cost of LLMs-based agents, we randomly sample free of charge, and high-quality consideration.
100 cases as test sets from the complex tasks in Specifically, we invite four annotators to manually
Toolbench datasets to evaluate the performance solve a task step-by-step using the ground truth
of our cooperative agent framework in solving tools provided by existing datasets. To guarantee
practical tasks. annotation quality, we ask at least two annotators
to solve the tasks in the dataset repeatedly. If
Extend existing datasets. Existing datasets only there is a discrepancy between the two annotators
annotate the required tools to solve the tasks, which (i.e., two annotators give different solutions), we
lacks what arguments to pass, execution results, ask a third annotator to recheck it. Finally, we
and how to incorporate the results step-by-step. We extend the full datasets of TMDB and Spotify, with
first extend the RestBench-TMDB and RestBench- each case containing the required tool names, the
Spotify datasets with a fine-granularity task-solving corresponding arguments, the execution results,
trajectory to support our human evaluation to and the final answer. We randomly sample 30 cases
Algorithm 1: Construct the schema
Use GET_movie_id_similar to get a list of
similar movies to Titanic (movie id is 597) of the execution result with recursive
decomposition.
Iterate Calibration Data: root, indent (default = 0)
with Code Interpreter
Result: The synthetic schema, represented
{ as a list of tuples, i.e., (name, type,
"base_url": "https://api.themoviedb.org/3/",
"page": 1, depth).
"path": "movie/{movie_id}/similar",
"description": "The reviews of the first // depth indicates the hierarchical
movie that is similar to Titanic." structure for each element in a
}
complex variable
{"success":false,"status_code":6,"status_message":" // IsBasic → True/False
Invalid id: The pre-requisite id is invalid or not
found."} // Iterate → Each element
1 Function RecurDecompose(root, depth):
{
"base_url": "https://api.themoviedb.org/3/",
2 name ← root.name
"page": 1, 3 type ← root.type
"path": "movie/597/similar",
"description": "The reviews of the first 4 tree ← List[(name, type, depth)]
movie that is similar to Titanic."
} 5 if IsBasic(root) then
6 return tree
{"page":1,"results":[{"adult":false,"backdrop_path":"/Y
GDLU2kse6BwcAXRPulP4FvxqE.jpg","genre_ids":[35,1 7 foreach child in Iterate(root) do
8,27],"id":369936,"original_language":"ta","original_ti
tle":"டா#லி& II","overview":"A group of friends 8 tmp ← RecurDecompose(child,
goes to a hill station to unwind, but one of them gets
possessed by a ghost turning the holiday into…
depth + 1)
9 tree.append(tmp)
Figure 7: An example to illustrate the workflow of 10 return tree // get the tree
our iterative calibration with tool servers, where the
execution agent calibrates the misaligned arguments
following the error messages from the tool servers and
executes the tools successfully to request the data. in Algorithm 1. To explain more intuitively, we
also illustrated our algorithm with an example in
Figure 10, which demonstrates how to construct the
from two extended datasets, respectively, which are schema (right) based on lengthy execution results
used for our human evaluation. We also release the (left).
extended datasets to facilitate future research. The Synthetic schema paired with execution
results is then fed into observing agent, prompting
A.4 System prompt for Agents the observing agent to generate the code used to
We provide the system prompts to enable the agents extract relevant values. This method can be used
in our ConAgents to perform specialized actions. to deal with unrecorded tools in more general
The system prompt for the grounding agent is scenarios.
shown in Table 6. The system prompt for the
execution agent is shown in Table 7. The system A.6 Efficiency Analysis
prompt for the observing agent is shown in Table 8. We further explore the efficiency of our proposed
ConAgents, where we compare the token
A.5 Synthetic Schema with Recursive
consumption for the ConAgents and strong
Decomposition
baselines ReAct@N using the RestBench-TMDB
We propose a recursive decomposition algorithm to dataset with the same setting in Table 1. And we set
synthesize a schema automatically. Our algorithm the N from 1 to 5 for a comprehensive comparison.
constructs the schema by recursively decomposing We show the frequency histogram for the number
each element in the lengthy execution results, of consumed tokens of different methods Figure 9.
representing the hierarchical structure among We find that our framework spends fewer tokens
elements with the depth of recursion, and listing compared with strong baselines ReAct@N (N≥ 2)
the type of the corresponding value as the schema. with almost the same token consumption as vanilla
A pseudo code of the overall procedure is given ReAct. The potential reason is that the modularized
Prompt to enable the grounding agent
System prompt
You are an agent that accesses external Tools and utilizes them to solve user queries step by step.
Please interleave three steps: Thought, Tool Selection and Execution Result. Here is the format:
User Query: the query of users
Thought: "Specify which tool to use and what value you want to obtain through this tools."
Tool Selection: “Just select a tool from the list. DO NOT change the name of the tool."
Execution Result: "Specify what you want to obtain from the execution results."
...(The Thought, API Selection, and Execution Result can be repeated N times)
Final Answer: Your final answer to this task
Note: (1) Every step of Thought should remain atomic, e.g., keeping it as simple as possible and include only one single tool.
(2) DO NOT use pronouns in the plan but use the corresponding results obtained previously.
==API List==
{api_list}
Table 6: The prompt to enable the grounding agent in our cooperative framework. We detail the system prompt and
input prompt, respectively.
Input prompt
You can access external tools to execute the instruction:
{instruction}
Please complete the arguments and request data from servers based on the tool documentation:
{documentation}
When receiving error messages, you should revise your output and try again.
Table 7: The prompt to enable the execution agent in our cooperative framework. We detail the system prompt and
input prompt, respectively.
Input prompt
Here is a python “ dict ” called “ response ”. Please write code to extract values from the ‘response‘ following the instruction:
{instruction}
You should revise your code when you receive an exception error from Python interpreter.
Your Output:
“‘python
[Python code]
“‘
Table 8: The prompt to enable the observing agent in our cooperative framework. We detail the system prompt and
input prompt, respectively.
Based on the person id (153), use movies = response.get('cast')
"person_movie_credits" to get the if movies is not None:
movies directed by this person . names = []
ids = []
for movie in movies:
{'url':'https://api.themoviedb.org/3/ names.append(movie['name’])
person/153/movie_credits’, 'page’: 1, ids.append(movie['id’])
'description': 'The movie credit list'} if names and ids:
result = " ".join(names)
print(result)
{"cast":[{"adult":false,"backdrop_path":"/uy
w4efhaBsktg.jpg","genre_ids":[99],"id":686
Error messages:
132,"original_language":"en...
-----------------------------
Exception Type: <class 'TypeError'>
Iterative Calibration Exception Value: sequence item 0: expected
str instance, NoneType found
with Code Interpreter
Figure 8: An example to illustrate the workflow of our iterative calibration with code interpreter, where the observing
agent can revise the programming error following the exceptional error from the interpreter and generate correct
code to extract the relevant values from lengthy content.
Frequency
Frequency
0.6 0.8 0.8
0.0 10000 20000 30000 40000 0.0 10000 20000 30000 0.0 10000 20000 30000
Token Consumption of ours Token Consumption of ReAct@1 Token Consumption of ReAct@2
1.2 1e-4 1.00 1e-4 1.00 1e-4
Frequency
Frequency
0.0 0 25000 50000 75000 0.00 0 25000 50000 75000 0.00 0 20000 40000 60000
Token Consumed of ReAct@3 Token Consumption of ReAct@4 Token Consumption of ReAct@5
Figure 9: The efficiency analysis for our method and ReAct@N baselines (1 ≤ N ≤ 5), where we count the
distribution of consumed tokens cost by each case and compute the average consumption µ.
- response: dict
{“album_type”: “album”, “artists”: [{“name”: “Pitbull”, “type”: “artist”, “uri”: - album_type: str
“spotify:artist:0TnOYISbd1XYRBk9myaseg”}], “copyrights”: [{“text”: “(P) 2012 - artists: list
RCA Records, a division of Sony Music Entertainment”, “type”: “P”}], “id”: - artists[0]: dict
“4aawyAB9vmqN3uQ7FjRGTy”, - name: str
“images”:[{“height”:640,“url”:“https://i.scdn.co/image/ab67616d0000b2732c5b2 - type: str
4ecfa39523a75c993c4”,“width”:640},{“height”:64,“url”:“https://i.scdn.co/image - uri: str
/ab67616d00048512c5b24ecfa39523a75c993c4”, “width”: 64}], “is_playable”: - copyrights: list
true, “label”: “Mr.305/Polo Grounds Music/RCA Records”, “name”: “Global - copyrights[0]: dict
Warming”, “popularity”: 57, “release_date”: “2012-11-16”, “release_d - text: str
ate_precision”: “day”, - type: str
“total_tracks”:18,“tracks”:{“href”:“https://api.spotify.com/v1/albums/4aawyAB9 - id: str
vmqN3uQ7FjRGTy/tracks?offset=0&limit=50&locale=zh-CN,zh;q=0.9”, “items”: - images: list
[{“artists”: [{“external_urls”: {“spotify”: - images[0]: dict
“artist/0TnOYISbd1XYRBk9myaseg”},“href”:“https://api.spotify.com/v1/artists - height: int
/0TnOYISbd1XYRBk9myaseg”, “id”: “0TnOYISbd1XYRBk9myaseg”, - url: str
“name”:“Pitbull”,“type”:“artist”,“uri”:“spotify:artist:0TnOYISbd1XYRBk9myaseg”} - width: int
,{“external_urls”:{“spotify”:“artist/7iJrDbKM5fEkGdm5kpjFzS”},“href”:“https:/ - is_playable: bool
/api.spotify.com/v1/artists/7iJrDbKM5fEkGdm5kpjFzS”, “id”: - label: str
“7iJrDbKM5fEkGdm5kpjFzS”, “name”: “Sensato”, “type”: “artist”, “uri”: - name: str
“spotify:artist:7iJrDbKM5fEkGdm5kpjFzS”}], - popularity: int
“disc_number”:1,“explicit”:true,“external_urls”:{“spotify”:“track/6OmhkSOpvYBo - release_date: str
kMKQxpIGx2”},“href”:“https://api.spotify.com/v1/tracks/6OmhkSOpvYBokMK - release_date_precision: str
QxpIGx2”, “id”: “6OmhkSOpvYBokMKQxpIGx2”, “restrictions”: {“reason”: - total_tracks: int
“age_verification_required”}}, - tracks: dict
{“artists”:[{“external_urls”:{“spotify”:“artist/0TnOYISbd1XYRBk9myaseg”},“hre - href: str
f”:“https://api.spotify.com/v1/artists/0TnOYISbd1XYRBk9myaseg”, “id”: - items: list
“0TnOYISbd1XYRBk9myaseg”, “name”: “Pitbull”, “type”: “artist”, “uri”: - items[0]: dict
“spotify:artist:0TnOYISbd1XYRBk9myaseg”}, {“external_urls”: {“spotify”: - artists: list
“artist/2L8yW8GIoirHEdeW4bWQXq”}, - artists[0]: dict
“href”:“https://api.spotify.com/v1/artists/2L8yW8GIoirHEdeW4bWQXq”, “id”: - external_urls: dict
“2L8yW8GIoirHEdeW4bWQXq”, “name”: “TJR”, “type”: “artist”, “uri”: - spotify: str
“spotify:artist:2L8yW8GIoirHEdeW4bWQXq”}], “disc_number”: 1, - href: str
“duration_ms”: 206120, “explicit”: false, “external_urls”: - id: str
{“spotify”:“track/2iblMMIgSznA464mNov7A8”},“href”:“https://api.spotify.com/ - name: str
v1/tracks/2iblMMIgSznA464mNov7A8”, “id”: “2iblMMIgSznA464mNov7A8”, - type: str
“restrictions”: {}}, {“artists”: [{“external_urls”: {“spotify”: - uri: str
“artist/0TnOYISbd1XYRBk9myaseg”}, - disc_number: int
“href”:“https://api.spotify.com/v1/artists/0TnOYISbd1XYRBk9myaseg”, “id”: - explicit: bool
“0TnOYISbd1XYRBk9myaseg”, “name”: “Pitbull”, “type”: “artist”, “uri”: - external_urls: dict
“spotify:artist:0TnOYISbd1XYRBk9myaseg”}, {“external_urls”: {“spotify”: - spotify: str
“artist/1l7ZsJRRS8wlW3WfJfPfNS”}, “href”: - href: str
"https://api.spotify.com/v1/artists/1l7ZsJRRS8wlW3WfJfPfNS", "id": - id: str
"1l7ZsJRRS8wlW3WfJfPfNS", "name": "Christina Aguilera", "type": "artist", - restrictions: dict
"uri": "spotify:artist:1l7ZsJRRS8wlW3WfJfPfNS"}], "disc_number": 1, - reason: str
"duration_ms": 229506, "explicit": false, "external_urls": {"spotify": - limit: int
"track/4yOn1TEcfsKHUJCL2h1r8I"}, "href" - next: NoneType
"https://api.spotify.com/v1/tracks/4yOn1TEcfsKHUJCL2h1r8I", "id": - offset: int
"4yOn1TEcfsKHUJCL2h1r8I"}], "limit": 50, "next": null, "offset": 0, "previous": - previous: NoneType
null, "total": 18}, "type": "album", "uri": - total: int
"spotify:album:4aawyAB9vmqN3uQ7FjRGTy"} - type: str
- uri: str
Figure 10: A concrete example to demonstrate how to construct a schema for the tool response, where we recursively
decompose each element in the execution result, represent the hierarchical structure via indent, and list the type of
the corresponding values.
B Reviewer PHS5 (S: 3, O:3) quite similar to RestGPT. The introduction of
the Observing agent for processing lengthy tool
Weakness 1: The approach of dividing an agent
responses and the IterCali for error correction
into a planning agent, a tool-invocation agent, plus
does present some advancement; however, the
additional agents is not novel, there are several
overarching architecture’s resemblance to existing
similar works:
models raises questions about the framework’s
[1] Song Y, Xiong W, Zhu D, et al. Restgpt: novelty.
Connecting large language models with real-world
applications via restful apis[J]. arXiv preprint Response:
arXiv:2306.06624, 2023. Weakness 2 The IterCali, although crucial for
[2] Shen W, Li C, Chen H, et al. Small llms error correction and adaptability, bears similarity to
are weak tool learners: A multi-llm agent[J]. arXiv ReAct’s feedback-based action loop. The concept
preprint arXiv:2401.07324, 2024. itself is not that new. Additionally, the revise-
[3] Qian C, Dang Y, Li J, et al. Experiential co- based-on-feedback mechanism has been discussed
learning of software-developing agents[J]. arXiv in many previous works (e.g., Reflexion and
preprint arXiv:2312.17025, 2023. Generative Agents). I tend to not call it novel after
Response: around 1 year after the release of Reflexion and
Generative Agents. It is more like a common trick.
Weakness 2: The issue of "the limitation
of the inherent capability of a single LLM Response:
to perform diverse actions" proposed in this Weakness 3 The choice of baselines in the
paper, has not been well analyzed. The paper analysis experiments only includes ReAct, which,
merely demonstrates that their proposed framework while relevant, restricts the comparative analysis’s
outperforms other frameworks. However, since breadth. The absence of a direct comparison with
the underlying LLM for all agents within this RestGPT and ToolLLM, especially considering
framework remains the same single LLM, the the conceptual similarities with RestGPT, might
experimental results do not actually illustrate the lead to an incomplete evaluation of the proposed
limitations of a single LLM’s capabilities. A framework’s performance and efficiency.
possible method to verify this may be using a
single-agent framework, where the system prompt Response:
of this agent simultaneously describes the functions Weakness 4 Although stating that the observer is
of the grounding agent, execution agent, and "schema-free", it heavily relies on the assumption
observing agent, and then sequentially execute this that the tool responses are in JSON or other similar
single agent as the process introduced in this paper. structured formats. Compared to not using an
observer, the generalizability is decreased while
Response:
the performance is improved.
C Reviewer djSr (S:3.5, O:3) Response:
Weakness 1 authors don’t discuss much of the Overall Weakness I acknowledge that this paper
characteristics of the datasets and their complexity is "solid" in the sense of the comprehensive
experimental work presented. However, the
Response:
architectural resemblance between ConAgents and
Weakness 2 human evaluation is small in scale RestGPT raises concerns about the novelty and
(3 annotators only) significance of the proposed modifications. The
incremental modification of the architecture seems
Response: trivial to me. I understand that the research in
agent-based systems is fast-paced, but introducing
D Reviewer YkaU (S: 2.5, O: 2.5)
some incremental changes to an architecture that
Weakness 1 The core concept of modularizing was established more than half a year ago and not
the tool learning process into specialized agents showing really significant improvement is not so
shares similarities with prior works such as appealing to me. In light of these considerations,
RestGPT. The overall architecture design is also I’m inclined to a weak rejection.
Response:
Comments In Table 1, ToolLLM is the best at
the Path% of ToolBench, but it is not bolded. And
in the Table 1 ablation study, it should be M_O
instead of M_E?
Response: