-
摘要: 大语言模型因其强大的生成和理解能力受到广泛关注, 但在获取实时信息和执行复杂计算上仍存在局限性. 为使其更好地响应用户需求, 赋予大语言模型工具使用能力成为当下的研究热点. 首先, 明确大语言模型工具使用的基本概念, 并按照时间顺序梳理工具使用的发展脉络. 随后, 总结与工具使用相关的数据集和技术方法, 并分析其在智能体和具身智能等领域的应用. 最后, 梳理大语言模型工具使用领域未来的研究重点与发展方向.Abstract: Large language models have garnered significant attention due to their strong generative and comprehension capabilities. However, they still face limitations in accessing real-time information and performing complex calculations. To better address user needs, empowering large language models with tool-use capabilities has become a major topic of current research. Firstly, we clarify the fundamental concepts of tool use in large language models and then trace its developmental trajectory in chronological order. Subsequently, we summarize the datasets and technical approaches related to tool use, and analyze their applications in fields such as agents and embodied intelligence. Finally, we present an outlook on future research priorities and development directions in the field of tool use in large language models.
-
Key words:
- Large language models /
- tool use /
- tool-augmented generation /
- agent /
- embodied intelligence
-
表 1 大语言模型工具使用的发展
Table 1 The development of tool use in large language models
发布时间 名称 工具的类型 对话轮次 工具数量及关系 工具使用能力的获取方法 2022-05 TALM[5] 接口式工具 单次询问 多工具(含复杂关系) 有监督微调 2022-11 PAL[4] Python解释器 单次询问 多工具 情境学习 2023-02 Toolformer[6] 接口式工具 单次询问 单工具 有监督微调 2023-03 GPT4-Plugin[1] 接口式工具 多轮对话 多工具 有监督微调 + 强化学习 2023-03 HuggingGPT[11] 神经网络模块 单次询问 多工具(含复杂关系) 情境学习 2023-03 ViperGPT[23] Python函数 单次询问 多工具(含复杂关系) 情境学习 2023-04 MOSS[7] 接口式工具 多轮对话 多工具 有监督微调 2023-04 API-Bank[19] 接口式工具 多轮对话 多工具 有监督微调 2023-05 APIBench[31] Python函数 单次询问 单工具 有监督微调 2023-05 GPT4Tools[15] 神经网络模块 多轮对话 多工具 情境学习 2023-05 ToolkenGPT[41] 接口式工具 单次询问 多工具(含复杂关系) 有监督微调 2023-05 TRICE[18] 接口式工具 单次询问 多工具(含复杂关系) 有监督微调 + 强化学习 2023-05 CRITIC[12] Python函数 单次询问 多工具 情境学习 2023-05 LATM[24] Python函数 单次询问 单工具 情境学习 + 创建工具 2023-05 CREATOR[25] Python函数 单次询问 多工具 情境学习 + 创建工具 2023-05 ToolBench[17] 接口式工具 单次询问 单工具 情境学习 2023-06 ToolAlpaca[20] 接口式工具 多轮对话 多工具 有监督微调 2023-07 ToolLLM[14] 接口式工具 单次询问 多工具 有监督微调 2023-08 Confucius[35] 接口式工具 单次询问 多工具 多阶段的有监督微调 2023-09 ToRA[26] Python解释器 单次询问 多工具(含复杂关系) 有监督微调 2023-09 CRAFT[32] Python函数 单次询问 多工具(含复杂关系) 情境学习 2023-10 MetaTool[10] 接口式工具 单次询问 多工具 情境学习 2023-10 ToolChain[38] 接口式工具 单次询问 多工具 情境学习 + 决策过程优化 2023-11 ToolTalk[48] Python函数 多轮对话 多工具(含复杂关系) 情景学习 2023-12 CLOVA[33] Python函数 单次询问 多工具(含复杂关系) 情境学习 2023-12 T-Eval[13] 接口式工具 多轮对话 多工具(含复杂关系) 情境学习 2024-01 ToolEyes[49] 接口式工具 单次询问 多工具 有监督微调 2024-01 MLLM-Tool[50] 神经网络模块 单次询问 多工具(含复杂关系) 有监督微调 2024-01 TroVE[34] Python函数 单次询问 多工具 情境学习 + 创建工具 2024-01 EasyTools[43] 接口式工具 单次询问 多工具 情境学习 + 工具文档压缩 2024-02 AnyTool[39] 接口式工具 单次询问 多工具 情境学习 + 检索过程优化 2024-02 SciToolBench[51] Python函数 单次询问 多工具 有监督微调 2024-03 ToolRerank[40] 接口式工具 单次询问 多工具 情境学习 + 检索过程优化 2024-03 STE[16] 接口式工具 单次询问 单工具 有监督微调 + 对错误反馈处理 2024-05 Seal-Tools[52] 接口式工具 单次询问 多工具(含复杂关系) 有监督微调 2024-06 ToolPreference[53] 接口式工具 单次询问 多工具 有监督微调 + 偏好优化 2024-06 UltraTool[54] 接口式工具 多轮对话 多工具(含复杂关系) 情境学习 2024-07 GTA[55] 接口式工具 单次询问 多工具(含复杂关系) 情境学习 2024-07 Llama-3.1[8] 接口式工具 多轮对话 多工具 有监督微调 + 强化学习 2024-07 AppWorld[27] 手机应用 单次询问 多工具(含复杂关系) 情境学习 2024-07 ShortcutsBench[28] 手机应用 单次询问 多工具 情境学习 2024-08 ToolSandbox[29] 手机应用 多轮对话 多工具(含复杂关系) 有监督微调 2024-09 ToolACE[2] 接口式工具 多轮对话 多工具(含复杂关系) 有监督微调 2024-10 StepTool[44] 接口式工具 单次询问 多工具 强化学习 2024-10 MTU-Bench[37] 接口式工具 多轮对话 多工具(含复杂关系) 有监督微调 2024-10 ToolGen[42] 接口式工具 单次询问 多工具 有监督微调 + 工具文档压缩 2024-10 AndroidWorld[30] 手机应用 单次询问 多工具(含复杂关系) 情境学习 表 2 工具使用数据集概览
Table 2 The overview of tool use datasets
数据集 工具数量 数据数量 单轮 多轮 单工具 多工具 独立关系 依赖关系 嵌套关系 Toolformer[6] 5 12 500 √ × √ × × × × API-Bank[19] 2 211 2 202 √ √ √ √ √ × × APIBench[31] 11 645 16 450 √ × √ × × × × ToolBench[17] 232 2 764 √ × √ × × × × ToolAlpaca[20] 426 3 938 √ √ √ √ √ × × RestBench[56] 94 157 √ × √ × × × × ToolQA[64] 13 530 √ × √ √ √ × × ToolLLM[14] 16 464 126 486 √ × √ √ √ × × MetaTool[10] 199 21 127 √ × √ √ √ √ × TaskBench[57] 103 28 127 √ × √ √ √ √ × ToolTalk[48] 28 78 √ √ √ √ √ √ × T-Eval[13] 15 533 × √ × √ √ √ × ToolEyes[49] 568 382 √ × √ √ √ × × UltraTool[54] 2 032 5 824 √ × √ √ √ √ √ MLLM-Tool[50] 932 11 642 √ × √ √ √ × × SciToolBench[51] 2 446 856 √ × √ √ × √ × Seal-Tools[52] 4 076 14 076 √ × √ √ √ √ √ ShortcutsBench[28] 1 414 7 627 √ × √ √ √ × × GTA[55] 14 229 √ × √ √ √ √ × AppWorld[27] 457 750 √ × √ √ √ √ √ ToolSandbox[29] 34 1 032 √ √ √ √ √ √ × CToolEval[65] 398 6 816 √ × √ √ √ √ × ToolACE[2] 26 507 11 300 √ √ √ √ √ √ √ MTU-Bench[37] 136 159 061 √ √ √ √ √ √ √ -
[1] OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L and Akkaya I et al. GPT-4 Technical Report. arXiv preprint arXiv: 2303.08774, 2024. [2] Liu W W, Huang X, Zeng X S, Hao X L, Yu S and Li D X et al. ToolACE: Winning the Points of LLM Function Calling. arXiv preprint arXiv: 2409.00920, 2024. [3] Abdelaziz I, Basu K, Agarwal M, Kumaravel S, Stallone M and Panda R et al. Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks. arXiv preprint arXiv: 2407.00121, 2024. [4] Gao L Y, Madaan A, Zhou S Y, Alon U, Liu P F and Yang Y M et al. PAL: Program-aided language models. In: Proceedings of the 40th International Conference on Machine Learning. Honolulu, Hawaii, USA: PMLR, 2023. 10764–10799 [5] Parisi A, Zhao Y and Fiedel N. TALM: Tool Augmented Language Models. arXiv preprint arXiv: 2205.12255, 2022. [6] Schick T, Yu J D, Dessi R, Raileanu R, Lomeli M and Hambro E et al. Toolformer: Language models can teach themselves to use tools. In: Preceedings of the Thirty-seventh Conference on Neural Information Processing Systems. Honolulu, Hawaii, USA: PMLR, 2023. [7] Sun T X, Zhang X T, He Z F, Li P, Cheng Q Y and Liu X Y, et al. MOSS: An Open Conversational Large Language Model. Machine Intelligence Research, 2024, 21(5): 888−905 doi: 10.1007/s11633-024-1502-8 [8] Dubey A M, Jauhri A, Pandey A, Kadian A, AlDahle A and Letman A et al. The Llama 3 Herd of Models. arXiv preprint arXiv: 2407.21783, 2024. [9] Qwen. QwQ-32B: Embracing the Power of Reinforcement Learning. GitHub Homepage, 2025. https://qwenlm.github.io/blog/qwq-32b/. [10] Huang Y, Shi J W, Li Y, Fan C R, Wu S Y and Zhang Q H et al. Metatool benchmark for large language models: Deciding whether to use tools and which to use. In: Proceedings of The Twelfth International Conference on Learning Representations. Vienna, Austria: OpenReview.net, 2024. [11] Shen Y L, Song K T, Tan X, Li D S, Lu W M and Zhuang Y T. HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. In: Proceedings of Advances in Neural Information Processing Systems. Honolulu, Hawaii, USA: PMLR, 2023. [12] Gou Z B, Shao Z H, Gong Y Y, Shen Y L, Yang Y J and Duan N et al. CRITIC: Large language models can self-correct with tool-interactive critiquing. In: The Twelfth International Conference on Learning Representations. Vienna, Austria: OpenReview.net, 2024. [13] Chen Z H, Du W H, Zhang W W, Liu K K, Liu J N and Zheng M et al. T-eval: Evaluating the tool utilization capability of large language models step by step. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand: Association for Computational Linguistics, 2024. 9510–9529 [14] Qin Y J, Liang S H, Ye Y N, Zhu K L, Yan L and Lu Y X et al. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. arXiv preprint arXiv: 2307.16789, 2023. [15] Yang R, Song L, Li Y W, Zhao S J, Ge Y X and Li X et al. GPT4Tools: Teaching large language model to use tools via self-instruction. In: Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023). New Orleans, LA, USA: NeurIPS Foundation, 2023. [16] Wang B S, Fang H, Eisner J, Durme B V and Su Y. LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error. arXiv preprint arXiv: 2403.04746, 2024. [17] Xu Q T, Hong F L, Li B, Hu C R, Chen Z Y and Zhang J. On the Tool Manipulation Capability of Open-source Large Language Models. arXiv preprint arXiv: 2305.16504, 2023. [18] Qiao S F, Gui H H, Lv C F, Jia Q H, Chen H J and Zhang N Y. Making Language Models Better Tool Learners with Execution Feedback. arXiv preprint arXiv: 2305.13068, 2024. [19] Li M H, Zhao Y X, Yu B W, Song F F, Li H Y and Yu H Y et al. API-bank: A comprehensive benchmark for tool-augmented LLMs. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, 2023. 3102–3116 [20] Tang Q Y, Deng Z L, Lin H Y, Han X P, Liang Q and Cao B X et al. ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases. arXiv preprint arXiv: 2306.05301, 2023. [21] Raffel C, Shazeer N, Roberts A, Lee K, Narang S and Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020, 21(140): 1−67 [22] Chen M, Tworek J, Jun H W, Yuan Q M, Pinto H P d O and Kaplan J et al. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv: 2107.03374, 2021. [23] Surís D, Menon S and Vondrick C. Vipergpt: Visual inference via python execution for reasoning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Paris, France: IEEE, 2023. 11888–11898 [24] Cai T L, Wang X Z, Ma T Y, Chen X Y and Zhou D. Large language models as tool makers. In: Proceedings of the International Conference on Learning Representations. Vienna, Austria, 2024. [25] Qian C, Han C, Fung Y, Qin Y, Liu Z and Ji H. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. In: Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics, 2023. 6922–6939 [26] Gou Z B, Shao Z H, Gong Y Y, Shen Y L, Yang Y J and Huang M L et al. Tora: A tool-integrated reasoning agent for mathematical problem solving. In: Proceedings of the Twelfth International Conference on Learning Representations. Vienna, Austria: OpenReview.net, 2024. [27] Trivedi H, Khot T, Hartmann M, Manku R, Dong V and Li E et al. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, 2024. 16022–16076 [28] Shen H Y, Li Y, Meng D S, Cai D Q, Qi S and Zhang L et al. Shortcutsbench: A large-scale real-world benchmark for API-based agents. In: Proceedings of the Thirteenth International Conference on Learning Representations. Singapore: OpenReview.net, 2025. [29] Lu J R, Holleis T, Zhang Y Z, Aumayer B, Nan F and Bai F et al. ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities. arXiv preprint arXiv: 2408.04682, 2024. [30] Rawles C, Clinckemaillie S, Chang Y F, Waltz J, Lau G and Fair M et al. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. arXiv preprint arXiv: 2405.14573, 2024. [31] Patil S G, Zhang T J, Wang X and Gonzalez J E. Gorilla: Large Language Model Connected with Massive APIs. arXiv preprint arXiv: 2305.15334, 2023. [32] Yuan L F, Chen Y Y, Wang X Y, Fung Y, Peng H and Ji H. CRAFT: Customizing LLMs by creating and retrieving from specialized toolsets. In: The Twelfth International Conference on Learning Representations. Vienna, Austria: OpenReview.net, 2024. [33] Gao Z, Du Y, Zhang X, Ma X, Han W and Zhu S C et al. Clova: A closed-loop visual assistant with tool usage and update. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024. 13258–13268 [34] Wang Z R, Neubig G and Fried D. TroVE: Inducing verifiable and efficient toolboxes for solving programmatic tasks. In: Forty-first International Conference on Machine Learning. Vienna, Austria: PMLR, 2024. 51177–51191 [35] Gao S, Shi Z L, Zhu M H, Fang B W, Xin X and Ren P J et al. Confucius: Iterative tool learning from introspection feedback by easy-to-difficult curriculum. In: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24). New York, NY, USA: AAAI Press, 2024. 18030–18038 [36] RapidAPI. RapidAPI: A Platform for Discovering and Connecting to APIs, 2024. [37] Wang P, Wu Y N, Wang Z K, Liu J H, Song X S and Peng Z Y et al. MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models. arXiv preprint arXiv: 2410.11710, 2024. [38] Zhuang Y C, Chen X, Yu T, Mitra S, Bursztyn V and Rossi R A et al. Toolchain*: Efficient action space navigation in large language models with a* search. In: The Twelfth International Conference on Learning Representations. Vienna, Austria: OpenReview.net, 2024. [39] Du Y, Wei F Y and Zhang H Y. Anytool: Self-reflective, hierarchical agents for large-scale API calls. In: Proceedings of Forty-first International Conference on Machine Learning. Vienna, Austria: PMLR, 2024. 33001–33015 [40] Zheng Y, Li P, Liu W, Liu Y, Luan J and Wang B. ToolRerank: Adaptive and hierarchy-aware reranking for tool retrieval. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Torino, Italia: ELRA and ICCL, 2024. 16263–16273 [41] Hao S B, Liu T Y, Wang Z and Hu Z T. ToolkenGPT: Augmenting frozen language models with massive tools via tool embeddings. In: Proceedings of Advances in Neural Information Processing Systems. New Orleans, LA, USA: NeurIPS Foundation, 2023. [42] Wang R X, Han X D, Ji L, Wang S, Baldwin T and Li H N. ToolGen: Unified Tool Retrieval and Calling via Generation. arXiv preprint arXiv: 2410.03439, 2024. [43] Yuan S, Song K, Chen J, Tan X, Shen Y and Kan R et al. Easytool: Enhancing llm-based agents with concise tool instruction. In: Proceedings of the LLM Agents Workshop at the International Conference on Learning Representations. Vienna, Austria: OpenReview.net, 2024. [44] Yu Y Q, Wang Z F, Ma W Z, Guo Z C, Zhan J T and Wang S et al. StepTool: A Step-grained Reinforcement Learning Framework for Tool Learning in LLMs. arXiv preprint arXiv: 2410.07745, 2024. [45] Team G, Georgiev P, Lei V I, Burnell R, Bai L B and Gulati A et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv: 2403.05530, 2024. [46] Yang A, Yang B S, Hui B Y, Zheng B, Yu B W and Zhou C et al. Qwen2 Technical Report. arXiv preprint arXiv: 2407.10671, 2024. [47] DeepSeek-AI, Zhu Q H, Guo D Y, Shao Z H, Yang D J and Wang P Y et al. DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence. arXiv preprint arXiv: 2406.11931, 2024. [48] Farn N and Shin R. ToolTalk: Evaluating Tool-Usage in a Conversational Setting. arXiv preprint arXiv: 2311.10775, 2023. [49] Ye J J, Li G Y, Gao S Y, Huang C S, Wu Y L and Li S X et al. ToolEyes: Fine-grained evaluation for tool learning capabilities of large language models in realworld scenarios. In: Proceedings of the 31st International Conference on Computational Linguistics. Abu Dhabi, UAE: Association for Computational Linguistics, 2025. 156–187 [50] Wang C Y, Luo W X, Chen Q Y, Mai H N, Guo J D and Dong S X et al. MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning. arXiv preprint arXiv: 2401.10727, 2024. [51] Ma Y, Gou Z, Hao J, Xu R, Wang S and Pan L et al. SciAgent: Tool-augmented Language Models for Scientific Reasoning. arXiv preprint arXiv: 2402.11451, 2024. [52] Wu M S, Zhu T, Han H, Tan C Y, Zhang X and Chen W L. Seal-Tools: Self-instruct tool learning dataset for agent tuning and detailed benchmark. In: Natural Language Processing and Chinese Computing: NLPCC 2024. Springer, 2024. 372–384 [53] Chen S J, Wang Y B, Wu Y F, Chen Q G, Xu Z and Luo W H et al. Advancing Tool-Augmented Large Language Models: Integrating Insights from Errors in Inference Trees. In: Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024). New Orleans, LA, USA: NeurIPS Foundation, 2024. [54] Huang S, Zhong W, Lu J, Zhu Q, Gao J and Liu W et al. Planning, creation, usage: Benchmarking LLMs for comprehensive tool utilization in real-world complex scenarios. In: Findings of the Association for Computational Linguistics: ACL 2024. Bangkok, Thailand: Association for Computational Linguistics, 2024. 4363–4400 [55] Wang J Z, Ma Z R, Li Y N, Zhang S Y, Chen C L and Chen K et al. GTA: A benchmark for general tool agents. In: Proceedings of Advances in Neural Information Processing Systems. New Orleans, LA, USA: NeurIPS Foundation, 2024. [56] Song Y F, Xiong W M, Zhu D W, Wu W H, Qian H and Song M B et al. RestGPT: Connecting Large Language Models with Real-World RESTful APIs. arXiv preprint arXiv: 2306.06624, 2023. [57] Shen Y L, Song K T, Tan X, Zhang W Q, Ren K and Yuan S Y et al. TaskBench: Benchmarking large language models for task automation. In: Proceedings of the 38th Conference on Neural Information Processing Systems. New Orleans, LA, USA: NeurIPS Foundation, 2024. [58] Basu K, Abdelaziz I, Chaudhury S, Dan S, Crouse M and Munawar A et al. API-BLEND: A comprehensive corpora for training and benchmarking API LLMs. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, 2024. 12859–12870 [59] Wang H, Wang R, Xue B, Xia H, Cao J and Liu Z et al. AppBench: Planning of multiple APIs from various APPs for complex user instruction. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: Association for Computational Linguistics, 2024. 15322–15336 [60] Wang W X, Shi J L, Wang C Z, Lee C, Yuan Y L and Huang J t et al. Learning to Ask: When LLMs Meet Unclear Instruction. arXiv preprint arXiv: 2409.00557, 2024. [61] Ye J, Li S, Li G, Huang C, Gao S and Wu Y et al. ToolSword: Unveiling safety issues of large language models in tool learning across three stages. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, 2024. 2181–2211 [62] Ye J, Wu Y, Gao S, Huang C, Li S and Li G et al. RoTBench: A multi-level benchmark for evaluating the robustness of large language models in tool learning. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: Association for Computational Linguistics, 2024. 313–333 [63] Guo Z C, Cheng S J, Wang H, Liang S H, Qin Y J and Li P et al. StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models. arXiv preprint arXiv: 2403.07714, 2024. [64] Zhuang Y C, Yu Y, Wang K, Sun H T and Zhang C. ToolQA: A dataset for LLM question answering with external tools. In: Advances in Neural Information Processing Systems. New Orleans, LA, USA: Curran Associates, Inc., 2023. [65] Guo Z, Huang Y and Xiong D. CToolEval: A Chinese benchmark for LLM-powered agent evaluation in real-world API interactions. In: Findings of the Association for Computational Linguistics: ACL 2024. Bangkok, Thailand: Association for Computational Linguistics, 2024. 15711–15724 [66] Papineni K, Roukos S, Ward T and Zhu W J. Bleu: a Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, 2002. 311–318 [67] Lin C Y. ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, 2004. 74–81 [68] Bergroth L, Hakonen H and Raita T. A survey of longest common subsequence algorithms. In: Proceedings Seventh International Symposium on String Processing and Information Retrieval. 2000. 39–48 [69] Liu Y M, Peng X Y, Zhang Y W, Cao J N, Zhang X H and Cheng S et al. Tool-Planner: Dynamic Solution Tree Planning for Large Language Model with Tool Clustering. arXiv preprint arXiv: 2406.03807, 2024. [70] Qiao S F, Fang R N, Qiu Z S, Wang X B, Zhang N Y and Jiang Y et al. Benchmarking Agentic Workflow Generation. arXiv preprint arXiv: 2410.07869, 2024. [71] OpenMOSS. UnifiedToolHub. GitHub repository, 2025. https://github.com/OpenMOSS/UnifiedToolHub. [72] Zhou S Y, Xu F F, Zhu H, Zhou X H, Lo R and Sridhar A et al. Webarena: A realistic web environment for building autonomous agents. In: Proceedings of the 12th International Conference on Learning Representations. Vienna, Austria: OpenReview.net, 2024. [73] Kim G W, Baldi P and McAleer S. Language models can solve computer tasks. In: Proceedings of the 37th Conference on Neural Information Processing Systems. New Orleans, LA, USA: Curran Associates, Inc., 2023. [74] Liu Y L, Yuan Y L, Wang C W, Han J H, Ma Y Q and Zhang L et al. From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs. arXiv preprint arXiv: 2402.18157, 2024. [75] Liu X, Qin B, Liang D Z, Dong G, Lai H Y and Zhang H C et al. AutoGLM: Autonomous Foundation Agents for GUIs. arXiv preprint arXiv: 2411.00820, 2024. [76] Qi Z H, Liu X, Iong I L, Lai H Y, Sun X Q and Zhao W Y et al. WebRL: Training LLM web agents via self-evolving online curriculum reinforcement learning. In: Proceedings of the 13th International Conference on Learning Representations (ICLR 2025). Singapore: OpenReview.net, 2025. [77] Wu Q, Liu W, Luan J and Wang B. ToolPlanner: A tool augmented LLM for multi granularity instructions with path planning and feedback. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: Association for Computational Linguistics, 2024. 18315–18339 [78] Chen K, Cusumano-Towner M, Huval B, Petrenko A, Hamburger J and Koltun V et al. Reinforcement learning for long-horizon interactive llm agents. arXiv preprint arXiv: 2502.01600, 2025. [79] Kong Y, Ruan J, Chen Y, Zhang B, Bao T and Shiwei S et al. TPTU-v2: Boosting task planning and tool usage of large language model-based agents in realworld industry systems. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. Miami, Florida, US: Association for Computational Linguistics, 2024. 371–385 [80] Liu X K, Peng Z Y, Yi X Y, Xie X, Xiang L R and Liu Y C et al. ToolNet: Connecting Large Language Models with Massive Tools via Tool Graph. arXiv preprint arXiv: 2403.00839, 2024. [81] Huang W, Abbeel P, Pathak D and Mordatch I. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In: Proceedings of the 39th International Conference on Machine Learning. PMLR, 2022. 9118–9147 [82] Xu H S, Zhu S, Wang Z H, Zheng H, Ma D and Cao R S et al. Reducing Tool Hallucination via Reliability Alignment. arXiv preprint arXiv: 2412.04141, 2024. [83] Xu G W, Jin P, Li H, Song Y B, Sun L C and Yuan L. Llava-cot: Let vision language models reason step-by-step. arXiv preprint arXiv: 2411.10440, 2024. [84] Koh J Y, McAleer S, Fried D and Salakhutdinov R. Tree search for language model agents. arXiv preprint arXiv: 2407.01476, 2024. [85] Chen P, Bu P, Song J, Gao Y and Zheng B. Can VLMs play action role-playing games? take black myth wukong as a study case. In: Proceedings of NeurIPS 2024 Workshop on Open-World Agents. New Orleans, LA, USA: Curran Associates, Inc., 2024. [86] Nakano R I, Hilton J, Balaji S, Wu J, Ouyang L and Kim C et al. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv: 2112.09332, 2022. [87] Yao S, Chen H, Yang J and Narasimhan K. Webshop: Towards scalable real-world web interaction with grounded language agents. In: Advances in Neural Information Processing Systems. 2022. 28744–28757 [88] Qiao S F, Fang R N, Zhang N Y, Zhu Y Q, Chen X and Deng S M et al. Agent planning with world knowledge model. In: Proceedings of The Thirty-eighth Annual Conference on Neural Information Processing Systems. New Orleans, LA, USA: Curran Associates, Inc., 2024. [89] Cao H, Zhang Y, Feng S, Yang X, Wang D and Zhang Y. TOOL-ED: Enhancing empathetic response generation with the tool calling capability of LLM. In: Proceedings of the 31st International Conference on Computational Linguistics. Abu Dhabi, UAE: Association for Computational Linguistics, 2025. 5305–5320 [90] Liao Z Y, Mo L B, Xu C J, Kang M T, Zhang J W and Xiao C W et al. EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage. arXiv preprint arXiv: 2409.11295, 2025. [91] Chen Z R, Xiang Z, Xiao C W, Song D and Li B. AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases. arXiv preprint arXiv: 2407.12784, 2024. [92] Xiang Z, Zheng L Z, Li Y J, Hong J Y, Li Q B and Xie H et al. GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning. arXiv preprint arXiv: 2406.09187, 2024. [93] OpenAI, Andrychowicz M, Baker B, Chociej M, Józefowicz R and McGrew B, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 2020, 39(1): 3−20 doi: 10.1177/0278364919887447 [94] Kavraki L, Svestka P, Latombe J C and Overmars M. Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Transactions on Robotics and Automation, 1996, 12(4): 566−580 doi: 10.1109/70.508439 [95] Shen Z Y, Wilson J P, Harvey R and Gupta S. MRRT: Multiple Rapidly-Exploring Random Trees for Fast Online Replanning in Dynamic Environments. arXiv preprint arXiv: 2104.11059, 2021. [96] Liang J, Huang W, Xia F, Xu P, Hausman K and Ichter B et al. Code as policies: Language model programs for embodied control. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). 2023. 9493–9500 [97] Ahn M, Brohan A, Brown N, Chebotar Y, Cortes O and David B et al. Do as i can, not as i say: Grounding language in robotic affordances. In: Proceedings of the 6th Conference on Robot Learning (CoRL). 2022. 150–161 [98] Yu Q J, Huang S Y, Yuan X B, Jiang Z K, Hao C and Li X et al. UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models. arXiv preprint arXiv: 2409.20551, 2024. [99] Huang W, Wang C, Zhang R, Li Y, Wu J and Fei-Fei L. Voxposer: Composable 3d value maps for robotic manipulation with language models. In: Proceedings of The 7th Conference on Robot Learning. PMLR, 2023. 540–562 [100] Huang W L, Wang C, Li Y Z, Zhang R H and Fei L F. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. In: Proceedings of 2nd CoRL Workshop on Learning Effective Abstractions for Planning. 2024. [101] Cai M X, Wang D L, Feng S and Zhang Y F. Pecer: Empathetic response generation via dynamic personality extraction and contextual emotional reasoning. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2024. 10631–10635 [102] Jin Q, Yang Y, Chen Q and Lu Z. GeneGPT: Augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics, 2024, 40(2): ii125−ii134 [103] Xiao S, Liu Z, Zhang P, Muennighoff N, Lian D and Nie J Y. C-Pack: Packed resources for general chinese embeddings. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). New York, NY, USA: Association for Computing Machinery, 2024. 641–649 [104] Li Z C, Wang J H, Jiang Z S, Mao H Y, Chen Z X and Du J Z et al. Dmqr-rag: Diverse multi-query rewriting for rag. arXiv preprint arXiv: 2411.13154, 2024. [105] Xu H S, Zhu S, Wang Z H, Zheng H, Ma D and Cao R S et al. Reducing tool hallucination via reliability alignment. arXiv preprint arXiv: 2412.04141, 2024. [106] Mialon G, Dessì R, Lomeli M, Nalmpantis C, Pasunuru R and Raileanu R et al. GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning. arXiv preprint arXiv: 2406.09187, 2024. [107] DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv: 2501.12948, 2025. [108] Zeng Z Y, Cheng Q Y, Yin Z Y, Wang B, Li S M and Zhou Y H et al. Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective. arXiv preprint arXiv: 2412.14135, 2024. -
计量
- 文章访问数: 31
- HTML全文浏览量: 35
- 被引次数: 0