-
摘要: 视觉−语言−动作(VLA)模型作为具身智能发展的核心方向, 旨在构建统一的多模态表示与感知–决策–执行一体化架构, 以突破传统模块化系统在功能割裂、语义对齐不足及泛化能力有限等瓶颈. 本文系统回顾前 VLA 时代的技术积淀, 梳理模块化、端到端与混合范式三类主流建模路径, 分析其结构特点、能力优势与面临的关键挑战. 在此基础上, 总结当前代表性 VLA 模型的体系结构、训练机制、多模态融合策略及应用成效, 并对典型数据集与评测基准进行分类比较. 最后, 结合跨模态协同、知识注入、长时序规划与真实环境泛化等方面, 展望未来 VLA 模型的发展趋势与研究方向.
-
关键词:
- 具身智能 /
- 视觉–语言–动作模型 /
- 多模态融合 /
- 端到端学习 /
- 任务泛化
Abstract: The Vision–Language–Action (VLA) model, as a core direction in the development of embodied intelligence, aims to construct a unified multimodal representation and an integrated perception–decision–execution architecture, in order to overcome the bottlenecks of traditional modular systems, such as functional fragmentation, insufficient semantic alignment and limited generalization capability. This paper systematically reviews the technical foundations laid in the pre-VLA era, categorizing and analyzing three mainstream modeling paradigms: Modular, end-to-end and hybrid, in terms of their structural characteristics, capabilities and key challenges. Furthermore, it summarizes the architectures, training mechanisms, multimodal fusion strategies, and application outcomes of representative contemporary VLA models, while providing a categorized comparison of typical datasets and evaluation benchmarks. Finally, the paper outlines future trends and research directions for VLA models, focusing on cross-modal collaboration, knowledge injection, long-term planning, and generalization in real-world environments. -
表 1 强化学习与模仿学习算法在具身智能中的应用特性对比
Table 1 Comparison of reinforcement and imitation learning algorithms in embodied intelligence
算法 类型 核心机制 优势 局限性 DQN[25] 强化学习 Q 值函数学习 + $ \varepsilon $-贪心 策略 结构清晰, 适用于离散动作控制 难扩展至连续控制, 样本效率较低 PPO[26] 强化学习 策略梯度 + 概率剪切(Clipping) 收敛稳定, 鲁棒性好, 应用广泛 超参数敏感, 收敛速度较慢 SAC[27] 强化学习 最大熵策略 + 双策略结构 支持连续控制, 收敛较快, 探索充分 结构复杂, 调参成本高 TD3[32] 强化学习 双 Q 网络 + 策略平滑 缓解过估计, 训练稳定性强 计算开销较大, 超参数较多 A3C[33] 强化学习 多线程异步 Actor-critic 高效并行, 能适应大规模任务 收敛不稳定, 梯度噪声较大 BC[34] 模仿学习 监督式模仿专家演示 简单高效, 适合小样本场景, 训练稳定 泛化较弱, 对专家数据质量依赖高 Dagger[30] 模仿学习 专家纠正 + 迭代数据聚合 缓解分布偏移, 性能稳健 需专家频繁参与, 代价较高 表 2 模块化、端到端与混合式 VLA 模型结构对比分析
Table 2 Comparison analysis of modular, end-to-end and hybrid VLA architectures
模型类型 系统结构特点 训练方式 优势 局限性 模块化结构 多子模块解耦, 分阶段执行 分模块独立训练 可解释性强, 结构清晰, 支持局部优化与升级 缺乏联合优化, 信息传递割裂, 泛化能力不足 端到端结构 所有输入统一Token表示, 由Transformer整体建模 端到端整体训练 表达能力强, 适应性好, 任务复用性高 可解释性弱, 调试困难, 对数据和算力依赖大 混合结构 通过 Prompt 驱动整体流程, 在关键位置保留接口与指令控制 灵活训练方式, 可结合端到端与模块化策略 模型灵活, 可同时支持多任务与 Prompt 控制 Prompt 设计复杂, 依赖度高, 调试与维护成本大 表 3 典型 VLA 模型的能力与挑战
Table 3 Capability and challenge comparison of representative VLA models with modeling paradigms
建模范式 模型 核心能力 多模态结构 技术特点 主要挑战 模块化 SayCan[9] 可行性驱动的语义规划与执行选择 LLM + 感知/技能库 模块边界清晰, 语义链条透明 缺少端到端联合优化, 模块耦合度高, 跨任务泛化弱 PerAct[39] 细粒度操作动作生成 点云 + 语言 + 低级动作 控制粒度细, 低层执行能力强 任务级语义理解不足, 全局策略规划能力弱 端到端 Gato[36] 多任务统一策略生成 图像/文本/动作统一 Token Token 级建模, 支持跨任务迁移与共享 可解释性弱, 数据需求大, 任务相互干扰 RT-2[38] 从网页知识迁移到机器人执行 文本 + 图像 + 动作 语言–视觉迁移强, 能支持现实任务执行 控制链复杂, 语料依赖重, 开放场景泛化待验证 PaLM-E[35] 通用多模态机器人控制 传感器/语言/图像/动作融合 端到端映射复杂输入到动作 参数规模大, 算力与部署门槛高 混合范式 VoxPoser[40] 空间约束下的任务分解与动作规划 语言 + 图像 + 3D 空间表示 LLM 生成约束图, 规划可解释 依赖精确空间建模, 复杂场景鲁棒性不足 3D-VLA[41] 三维环境感知与跨模态融合 3D视觉 + 语言 + 动作 3D特征编码结合语言引导, 空间理解更强 3D数据获取/计算代价高, 泛化受限 Inner Monologue[42] 语言驱动的显式自我规划与修正 图像 + 语言 + 动作(语言推理为中间层) 通过内部语言推理提升透明度与可介入性 语言规划的一致性与稳定性有待提升 表 4 真实环境数据集以及仿真环境数据集
Table 4 Real-world environment datasets and simulated environment datasets
名称 收集方式 指令形式 具身平台(仿真平台) 场景数 任务数 片段数 MIME[107] 人工操作 演示 Baxter机器人 1 20 8.3 K RoboNet[108] 脚本预设 目标图像 7种机器人 10 – 162 K MT-Opt[109] 脚本预设 自然语言 7种机器人 1 12 800 K BC-Z[77] 人工操作 自然语言/演示 Everyday机器人 1 100 25.9 K RT-1_ Kitchen[10] 人工操作 自然语言 Everyday机器人 2 700+ 130 K RoboSet[110] 人工/脚本 自然语言 Franka Panda机械臂 11 38 98.5 K BridgeData[111] 人工/脚本 自然语言 Widow X 250机械臂 24 – 60.1 K RH20T[112] 人工操作 自然语言 4种机器人 7 147 110 K+ DROID[113] 人工操作 自然语言 Franka Panda机械臂 564 – 76 K OXE[114] 聚合数据集 自然语言 22种机器人 311 160000 +1M+ Static ALOHA[122] 人工操作 演示 ALOHA机器人 1 10+ 825 VIMA-Data[37]* 脚本预设 自然语言/图像 PyBullet 1 13 650 K SynGrasp-1B[116]* 仿真模拟 视觉/动作轨迹 BoDex/CuRobo – 1 – 表 5 四种评估方法对比
Table 5 Comparison of four evaluation methods
评估方法 简述 优点 局限 代表性工作 分阶段评估 “仿真大筛选+真机小样本核验”的闭环; 真实$ \rightarrow $仿真$ \rightarrow $真实迭代收敛 低边际成本, 可控性强, 可信度较强 仿真-现实的差距可能累积; 仿真建模、对齐成本高 RialTo[132]; VR-Robo[138]; DREAM[139] 域自适应评估 显式划分源、目标域, 在给定适配预算下度量域迁移的性能 公平比较“适配速度与收益”; 贴近于实际应用 需统一协议与预算; 不同任务的跨域差异难完全对齐 Meta-RL-sim2real[140]; ADR[141]; BDA[142] 虚实协同评估 以与现实双向同步的虚拟环境做对照评测与故障注入 高度保真, 与现实一致性强; 绝对可控且安全 构建、维护虚拟环境的传感与建模成本高; 存在同步性、误差累计等问题 RoboTwin[143]; Real-is-Sim[144]; DT synchronization[145] 在线自适应评估 “测试即适配”, 在执行过程中实时监测—再适配 直接衡量反应速度、通讯与计算开销等指标 安全风险较高; 实现极为复杂 MonTA[146]; RMA[147]; A-RMA[148] 表 6 基准测试常用评估指标名称、解释以及其应用任务
Table 6 Names、definitions and application tasks of common benchmarking metrics
名称 解释 适用任务 成功率 成功完成任务比例 所有基准测试通用 路径效率加权成功率 考虑路径效率的成功率 导航任务、操作任务等 样本效率 达到目标性能所需的训练样本数 强化学习、模仿学习等 泛化得分 在未见过的环境/对象上的性能表现 跨任务、跨场景任务等 导航误差 终点与目标位置之间的距离误差 导航类任务 任务完成率 完整任务序列的完成比例 多步骤、长序列任务评估 前向迁移 学习新任务时利用已有知识的能力 多任务学习、终身学习等 后向迁移 学习新任务后对旧任务性能的影响 多任务学习、终身学习等 回合奖励 单轮任务的累计奖励值 强化学习算法 鲁棒性得分 面对噪声、扰动时的稳定性 算法可靠性评估 表 7 常用基准测试
Table 7 Common benchmark tests
名称 指令形式 任务描述 评估指标 CALVIN[155] 自然语言 长序列、多步骤操作 连续任务成功率、语言理解得分、长期规划 RLBench[152] 视觉引导 视觉操控 成功率、样本效率 VLN-CE[151] 自然语言 3D 环境语言导航 成功率、路径效率加权成功率、导航误差 LIBERO[156] 自然语言 终身学习 前向迁移、后向迁移、任务间相互干扰水平 Meta-World[153] 目标向量 多任务元学习 平均成功率、前向学习、后向学习 Franka Kitchen[154] 目标向量 长时程厨房操控 任务完成率, 子任务完成率、鲁棒性得分 DeepMind Control[157] 连续控制 连续控制任务 回合奖励、样本效率 -
[1] Lake B M, Ullman T D, Tenenbaum J B, et al. Building Machines That Learn and Think Like People. Behavioral and Brain Sciences, 2017, 40: e253 doi: 10.1017/S0140525X16001837 [2] Brooks R A. Intelligence without representation. Artificial Intelligence, 1991, 47(1-3): 139−159 doi: 10.1016/0004-3702(91)90053-M [3] Shridhar M, Thomason J, Gordon D, et al. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 10740–10749. [4] Anderson P, Wu Q, Teney D, et al. Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 3674–3683. [5] Artzi Y, Zettlemoyer L. Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions. Transactions of the Association for Computational Linguistics, 2013, 1: 49−62 doi: 10.1162/tacl_a_00209 [6] Radford A, Kim J W, Hallacy C, et al. Learning Transferable Visual Models From Natural Language Supervision[C]. International Conference on Machine Learning. PMLR, 2021: 8748–8763. [7] Brown T, Mann B, Ryder N, et al. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 2020, 33: 1877−1901 [8] Shridhar M, Manuelli L, Fox D. CLIPORT: What and Where Pathways for Robotic Manipulation[C]. Conference on Robot Learning. PMLR, 2022: 894–906. [9] Ahn M, Brohan A, Brown N, et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. Conference on Robot Learning. PMLR, 2023287−318 [10] Brohan A, Brown N, Carbajal J, et al. Rt-1: Robotics transformer for real-world control at scale[J]. arXiv preprint arXiv: 2212.06817, 2022. [11] Chen C, Wu Y F, Yoon J, et al. Transdreamer: Reinforcement learning with transformer world models[J]. arXiv preprint arXiv: 2202.09481, 2022. [12] Ren L, Dong J, Liu S, et al. Embodied intelligence toward future smart manufacturing in the era of AI foundation model[J]. IEEE/ASME Transactions on Mechatronics, 2024. [13] Ma Y, Song Z, Zhuang Y, et al. A survey on vision-language-action models for embodied AI[J]. Computing Research Repository, abs/2405.14093. [14] Zhong Y, Bai F, Cai S, et al. A Survey on Vision-Language-Action Models: An Action Tokenization Perspective[J]. arXiv preprint arXiv: 2507.01925, 2025. [15] Din M U, Akram W, Saoud L S, et al. Vision language action models in robotic manipulation: A systematic review[J]. arXiv preprint arXiv: 2507.10672, 2025. [16] Sapkota R, Cao Y, Roumeliotis K I, et al. Vision-language-action models: Concepts, progress, applications and challenges[J]. arXiv preprint arXiv: 2505.04769, 2025. [17] Wang F Y. Parallel system methods for management and control of complex systems. Control and Decision, 2004, 19: 485−489 [18] 杨静, 王晓, 王雨桐, 刘忠民, 李小双, 王飞跃. 平行智能与CPSS: 三十年发展的回顾与展望. 自动化学报, 2023, 49(3): 614−634 doi: 10.16383/j.aas.c230015Jing Y, Xiao W, Yu-Tong W, et al. Parallel intelligence and CPSS in 30 years: An ACP approach. Acta Automatica Sinica, 2023, 49(3): 614−634 doi: 10.16383/j.aas.c230015 [19] Wang X, Yang J, Liu Y, et al. Parallel intelligence in three decades: A historical review and future perspective on ACP and cyber-physical-social systems. Artificial Intelligence Review, 2024, 57(9): 255 doi: 10.1007/s10462-024-10861-9 [20] 李柏, 郝金第, 孙跃硕, 等. 平行智能范式视角下的视觉-语言-动作模型发展现状与展望. 智能科学与技术学报, 2025, 7(3): 290−306Li B, Hao J, Sun Y, et al. Vision-Language-Action Models under ACP Paradigm:The State of the Art and Future Perspectives. Chinese Journal of Intelligent Science and Technology, 2025, 7(3): 290−306 [21] Villaroman N, Rowe D, Swan B. Teaching natural user interaction using OpenNI and the Microsoft Kinect sensor[C]. PMLR, 2011: 227-232. [22] Chitta S. MoveIt!: an introduction[M]. Robot Operating System (ROS) The Complete Reference (Volume 1). Cham: Springer International Publishing, 2016: 3-27. [23] Tellex S, Knepper R, Li A, et al. Asking for help using inverse semantics[J]. 2014. [24] Colledanchise M, Ögren P. Behavior Trees in Robotics and AI: An Introduction[J]. arXiv e-prints, 2017: arXiv: 1709.00084. [25] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529−533 doi: 10.1038/nature14236 [26] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[J]. arXiv preprint arXiv: 1707.06347, 2017. [27] Haarnoja T, Zhou A, Abbeel P, et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]. International Conference on Machine Learning. PMLR, 2018: 1861-1870. [28] Hafner D, Pasukonis J, Ba J, et al. Mastering diverse domains through world models[J]. arXiv preprint arXiv: 2301.04104, 2023. [29] Hafner D, Lillicrap T, Fischer I, et al. Learning latent dynamics for planning from pixels[C]. International Conference on Machine Learning. PMLR, 2019: 2555-2565. [30] Ross S, Gordon G J, Bagnell J A. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning[J]. arXiv preprint arXiv: 1011.0686, 2010. [31] Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 2022, 35: 27730−27744 [32] Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods[C]. International conference on machine learning. PMLR, 2018: 1587-1596. [33] Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning[C]. International conference on machine learning. PmLR, 2016: 1928-1937. [34] Pomerleau D A. ALVINN: an autonomous land vehicle in a neural network[C]. Proceedings of the 2nd International Conference on Neural Information Processing Systems. 1988: 305-313. [35] Driess D, Xia F, Sajjadi M S M, et al. PaLM-E: an embodied multimodal language model[C]. Proceedings of the 40th International Conference on Machine Learning. 2023: 8469-8488. [36] Reed S, Zolna K, Parisotto E, et al. A generalist agent[J]. arXiv preprint arXiv: 2205.06175, 2022. [37] Jiang Y, Gupta A, Zhang Z, et al. Vima: General robot manipulation with multimodal prompts[J]. arXiv preprint arXiv: 2210.03094, 2022, 2(3): 6. [38] Zitkovich B, Yu T, Xu S, et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control[C]. Conference on Robot Learning. PMLR, 2023: 2165-2183. [39] Shridhar M, Manuelli L, Fox D. PERCEIVER-ACTOR: A Multi-Task Transformer for Robotic Manipulation[C]. Conference on Robot Learning. PMLR, 2023: 785-799. [40] Huang W, Wang C, Zhang R, et al. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models[C]. Conference on Robot Learning. PMLR, 2023: 540-562. [41] Zhen H, Qiu X, Chen P, et al. 3D-VLA: a 3D vision-language-action generative world model[C]. Proceedings of the 41st International Conference on Machine Learning. 2024: 61229-61245. [42] Huang W, Xia F, Xiao T, et al. Inner Monologue: Embodied Reasoning through Planning with Language Models[C]. Conference on Robot Learning. PMLR, 2023: 1769-1782. [43] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[C]. International Conference on Learning Representations. 2020. [44] Huang S, Chang H, Liu Y, et al. A3VLM: Actionable Articulation-Aware Vision Language Model[C]. Conference on Robot Learning. PMLR, 2025: 1675-1690. [45] Oquab M, Darcet T, Moutakanni T, et al. DINOv2: Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research Journal, 20241−31 [46] Gbagbe K F, Cabrera M A, Alabbas A, et al. Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations[C]. IEEE International Conference on Systems, Man, and Cybernetics. 2024: 2864-2869. [47] Bai, Jinze, et al. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. Text Reading, and Beyond 2 (2023): 1. [48] Nair S, Rajeswaran A, Kumar V, et al. R3M: A Universal Visual Representation for Robot Manipulation[C]. Conference on Robot Learning. PMLR, 2023: 892-909. [49] Li C, Wen J, Peng Y, et al. PointVLA: Injecting the 3D World into Vision-Language-Action Models[J]. arXiv preprint arXiv: 2503.07511, 2025. [50] Sun L, Xie B, Liu Y, et al. GeoVLA: Empowering 3D Representations in Vision-Language-Action Models[J]. arXiv preprint arXiv: 2508.09071, 2025. [51] Tang W, Pan J H, Liu Y H, et al. GeoManip: Geometric Constraints as General Interfaces for Robot Manipulation[J]. arXiv preprint arXiv: 2501.09783, 2025. [52] Zhou Z, Zhu Y, Zhu M, et al. Chatvla: Unified multimodal understanding and robot control with vision-language-action model[J]. arXiv preprint arXiv: 2502.14420, 2025. [53] Hancock A J, Ren A Z, Majumdar A. Run-time observation interventions make vision-language-action models more visually robust[C]. 2025 IEEE International Conference on Robotics and Automation, 2025: 9499-9506. [54] Kirillov A, Mintun E, Ravi N, et al. Segment Anything[C]. Proceedings of the IEEE/CVF international conference on computer vision. 2023: 4015-4026. [55] Minderer M, Gritsenko A, Stone A, et al. Simple open-vocabulary object detection[C]. European conference on computer vision. Cham: Springer Nature Switzerland, 2022: 728-755. [56] Yang J, Tan R, Wu Q, et al. Magma: A Foundation Model for Multimodal AI Agents[C]. Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 14203-14214. [57] Zhao W, Ding P, Zhang M, et al. Vlas: Vision-language-action model with speech instructions for customized robot manipulation[J]. arXiv preprint arXiv: 2502.13508, 2025. [58] Touvron H, Lavril T, Izacard G, et al. LLaMA: Open and Efficient Foundation Language Models[J]. arXiv preprint arXiv: 2302.13971, 2023. [59] Jones J, Mees O, Sferrazza C, et al. Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding[J]. arXiv preprint arXiv: 2501.04693, 2025. [60] Samson M, Muraccioli B, Kanehiro F. Scalable, Training-Free Visual Language Robotics: A Modular Multi-Model Framework for Consumer-Grade GPUs[C]. IEEE/SICE International Symposium on System Integration. IEEE, 2025: 193-198. [61] Khan M H, Asfaw S, Iarchuk D, et al. Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing[C]. ACM/IEEE International Conference on Human-Robot Interaction. IEEE, 2025: 1393-1397. [62] Li J, Li D, Savarese S, et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models[C]. International conference on machine learning. PMLR, 2023: 19730-19742. [63] Wen J, Zhu Y, Li J, et al. DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control[J]. arXiv preprint arXiv: 2502.05855, 2025. [64] Zhang H, Li X, Bing L. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding[C]. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2023: 543-553. [65] Cheng A C, Ji Y, Yang Z, et al. NaVILA: Legged Robot Vision-Language-Action Model for Navigation[J]. arXiv preprint arXiv: 2412.04453, 2024. [66] Xu Z, Chiang H T L, Fu Z, et al. Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs[C]. Conference on Robot Learning. 2024. [67] Shi L X, Ichter B, Equi M, et al. Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models[J]. arXiv preprint arXiv: 2502.19417, 2025. [68] Zollo T P, Zemel R. Confidence Calibration in Vision-Language-Action Models[J]. arXiv preprint arXiv: 2507.17383, 2025. [69] Wu Y, Tian R, Swamy G, et al. From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment[J]. arXiv preprint arXiv: 2502.01828, 2025. [70] Ji Y, Tan H, Shi J, et al. RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete[C]. Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 1724-1734. [71] Wu Z, Zhou Y, Xu X, et al. MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation[C]. Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 1714-1723. [72] Li Y, Deng Y, Zhang J, et al. HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation[J]. arXiv preprint arXiv: 2502.05485, 2025. [73] Huang C P, Wu Y H, Chen M H, et al. ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning[J]. arXiv preprint arXiv: 2507.16815, 2025. [74] Qi Z, Zhang W, Ding Y, et al. SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation[J]. arXiv preprint arXiv: 2502.13143, 2025. [75] Li Y, Yan G, Macaluso A, et al. Integrating LMM Planners and 3D Skill Policies for Generalizable Manipulation[J]. arXiv preprint arXiv: 2501.18733, 2025. [76] Bi J, Ma K Y, Hao C, et al. VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback[J]. arXiv preprint arXiv: 2507.17294, 2025. [77] Jang E, Irpan A, Khansari M, et al. BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning[C]. Conference on Robot Learning. PMLR, 2022: 991–1002. [78] Ghosh D, Walke H R, Pertsch K, et al. Octo: An Open-Source Generalist Robot Policy[C]. Robotics: Science and Systems. 2024. [79] Gu J, Kirmani S, Wohlhart P, et al. Robotic Task Generalization via Hindsight Trajectory Sketches[C]. First Workshop on Out-of-Distribution Generalization in Robotics at CoRL 2023. 2023. [80] Zhao T, Kumar V, Levine S, et al. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware[J]. Robotics: Science and Systems XIX, 2023. [81] Ma Y, Chi D, Wu S, et al. Actra: Optimized Transformer Architecture for Vision-Language-Action Models in Robot Learning[J]. Computing Research Repository, 2024. [82] Liu J, Liu M, Wang Z, et al. RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation. Advances in Neural Information Processing Systems, 2024, 37: 40085−40110 [83] Chi C, Xu Z, Feng S, et al. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. The International Journal of Robotics Research, 202302783649241273668 [84] Liu S, Wu L, Li B, et al. RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation[C]. International Conference on Learning Representations, 2025. [85] Hou Z, Zhang T, Xiong Y, et al. Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy[J]. Computing Research Repository, 2025. [86] Hou Z, Zhang T, Xiong Y, et al. Diffusion Transformer Policy[J]. Computing Research Repository, 2024. [87] Touvron H, Martin L, Stone K, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models[J]. arXiv preprint arXiv: 2307.09288, 2023. [88] Chiang W L, Li Z, Lin Z, et al. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality[J]. See https://vicuna.lmsys.org (accessed 14 April 2023), 2023, 2(3): 6. [89] Zhang J, Wang K, Wang S, et al. Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks[J]. arXiv preprint arXiv: 2412.06224, 2024. [90] Fu H, Zhang D, Zhao Z, et al. ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation[J]. arXiv preprint arXiv: 2503.19755, 2025. [91] Song W, Chen J, Ding P, et al. CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding[J]. arXiv preprint arXiv: 2506.13725, 2025. [92] Zhai X, Mustafa B, Kolesnikov A, et al. Sigmoid loss for language image pre-training[C]. Proceedings of the IEEE/CVF international conference on computer vision. 2023: 11975-11986. [93] Li S, Wang J, Dai R, et al. RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model[J]. arXiv preprint arXiv: 2409.19590, 2024. [94] Ding P, Zhao H, Zhang W, et al. QUAR-VLA: Vision-Language-Action Model for Quadruped Robots[C]. European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 352-367. [95] Dey S, Zaech J N, Nikolov N, et al. Revla: Reverting visual domain limitation of robotic foundation models[C]. IEEE International Conference on Robotics and Automation. IEEE, 2025: 8679-8686. [96] Chen P, Bu P, Wang Y, et al. CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games[J]. arXiv preprint arXiv: 2503.09527, 2025. [97] Kim M J, Pertsch K, Karamcheti S, et al. OpenVLA: An Open-Source Vision-Language-Action Model[C]. Conference on Robot Learning. PMLR, 2025: 2679-2713. [98] Budzianowski P, Maa W, Freed M, et al. EdgeVLA: Efficient Vision-Language-Action Models[J]. arXiv preprint arXiv: 2507.14049, 2025. [99] Arai H, Miwa K, Sasaki K, et al. CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving[C]. IEEE/CVF Winter Conference on Applications of Computer Vision, 2025: 1933-1943. [100] Yang Z, Li L, Lin K, et al. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)[J]. arXiv preprint arXiv: 2309.17421, 2023. [101] Zhang J, Guo Y, Chen X, et al. HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers[C]. Conference on Robot Learning. PMLR, 2025: 933-946. [102] Zhao H, Song W, Wang D, et al. MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models[J]. Computing Research Repository, 2025. [103] Shukor M, Aubakirova D, Capuano F, et al. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics[J]. arXiv preprint arXiv: 2506.01844, 2025. [104] Huang S, Chen L, Zhou P, et al. EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation[J]. Computing Research Repository, 2025. [105] Zhao Q, Lu Y, Kim M J, et al. CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models[C]. Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 1702-1713. [106] Zhang W, Liu H, Qi Z, et al. DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge[J]. arXiv preprint arXiv: 2507.04447, 2025. [107] Sharma P, Mohan L, Pinto L, et al. Multiple interactions made easy (mime): Large scale demonstrations data for imitation[C]. Conference on robot learning. PMLR, 2018: 906–915. [108] Dasari S, Ebert F, Tian S, et al. RoboNet: Large-Scale Multi-Robot Learning[C]. Conference on Robot Learning. PMLR, 2020: 885-897. [109] Kalashnikov D, Varley J, Chebotar Y, et al. Mt-opt: Continuous multi-task robotic reinforcement learning at scale[J]. arXiv preprint arXiv: 2104.08212, 2021. [110] Kumar V, Shah R, Zhou G, et al. Robohive: A unified framework for robot learning. Advances in Neural Information Processing Systems, 2023, 36: 44323−44340 [111] Walke H R, Black K, Zhao T Z, et al. Bridgedata v2: A dataset for robot learning at scale[C]. Conference on Robot Learning. PMLR, 2023: 1723–1736. [112] Fang H S, Fang H, Tang Z, et al. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot[C]. IEEE International Conference on Robotics and Automation. IEEE, 2024: 653–660. [113] Khazatsky A, Pertsch K, Nair S, et al. DROID: A large-scale in-the-wild robot manipulation dataset[C]. Robotics: Science and Systems. 2024. [114] O’Neill A, Rehman A, Maddukuri A, et al. Open X-Embodiment: Robotic Learning Datasets and RT-X Models Open X-Embodiment Collaboration[C]. IEEE International Conference on Robotics and Automation. IEEE, 2024: 6892–6903. [115] Nasiriany S, Maddukuri A, Zhang L, et al. RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots[C]. RSS, 2024. [116] Deng S, Yan M, Wei S, et al. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data[J]. arXiv preprint arXiv: 2505.03233, 2025. [117] Liu W, Wan Y, Wang J, et al. FetchBot: Object Fetching in Cluttered Shelves via Zero-Shot Sim2Real[J]. arXiv preprint arXiv: 2502.17894, 2025. [118] Chen T, Chen Z, Chen B, et al. RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation[J]. arXiv preprint arXiv: 2506.18088, 2025. [119] Jiang Z, Xie Y, Lin K, et al. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning[C]. 2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025: 16923-16930. [120] Xiao T, Chan H, Sermanet P, et al. Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models[C]. Workshop on Language and Robotics at CoRL 2022. [121] Ahn M, Dwibedi D, Finn C, et al. AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents[C]. ICRA, 2024. [122] Fu Z, Zhao T Z, Finn C. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation[J]. arXiv preprint arXiv: 2401.02117, 2024. [123] Wu H, Jing Y, Cheang C, et al. Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation[C]. International Conference on Learning Representations, ICLR. 2024. [124] Ye S, Ye S, Jang J, Jeon B, et al. Latent Action Pretraining From Videos[C]. CoRL, 2024. [125] Bu Q, Cai J, Chen L, et al. AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems[J]. Computing Research Repository, 2025. [126] Yang R, Yu Q, Wu Y, et al. Egovla: Learning vision-language-action models from egocentric human videos[J]. arXiv preprint arXiv: 2507.12440, 2025. [127] Luo H, Feng Y, Zhang W, et al. Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos[J]. arXiv preprint arXiv: 2507.15597, 2025. [128] Lin F, Nai R, Hu Y, et al. OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning[J]. arXiv preprint arXiv: 2505.11917, 2025. [129] Pertsch K, Stachowicz K, Ichter B, et al. Fast: Efficient action tokenization for vision-language-action models[J]. arXiv preprint arXiv: 2501.09747, 2025. [130] Duan Z, Zhang Y, Geng S, et al. Fast ECoT: Efficient Embodied Chain-of-Thought via Thoughts Reuse[J]. arXiv preprint arXiv: 2506.07639, 2025. [131] Kalashnikov D, Irpan A, Pastor P, et al. Scalable deep reinforcement learning for vision-based robotic manipulation[C]. Conference on Robot Learning. PMLR, 2018: 651-673. [132] Torne M, Simeonov A, Li Z, et al. Reconciling Reality through Simulation: A Real-to-Sim-to-Real Approach for Robust Manipulation[J].Computing Research Repository, 2024. [133] Ma Y J, Liang W, Wang G, et al. Eureka: Human-level reward design via coding large language models[J]. arXiv preprint arXiv: 2310.12931, 2023. [134] Bjorck J, Castañeda F, Cherniadev N, et al. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots[J].Computing Research Repository, 2025. [135] Li M, Zhao S, Wang Q, et al. Embodied agent interface: Benchmarking llms for embodied decision making. Advances in Neural Information Processing Systems, 2024, 37: 100428−100534 [136] Yang R, Chen H, Zhang J, et al. EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents[J]. Computing Research Repository, 2025. [137] Li D, Cai T, Tang T, et al. EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments[J]. arXiv preprint arXiv: 2503.08604, 2025. [138] Zhu S, Mou L, Li D, et al. Vr-robo: A real-to-sim-to-real framework for visual robot navigation and locomotion[J]. IEEE Robotics and Automation Letters, 2025. [139] Lou H, Zhang M, Geng H, et al. DREAM: Differentiable Real-to-Sim-to-Real Engine for Learning Robotic Manipulation[C]. 3rd RSS, 2025. [140] Arndt K, Hazara M, Ghadirzadeh A, et al. Meta reinforcement learning for sim-to-real domain adaptation[C]. IEEE international conference on robotics and automation. IEEE, 2020: 2725-2731. [141] Mehta B, Diaz M, Golemo F, et al. Active domain randomization[C]. Conference on Robot Learning. PMLR, 2020: 1162-1176. [142] Truong J, Chernova S, Batra D. Bi-directional domain adaptation for sim2real transfer of embodied navigation agents. IEEE Robotics and Automation Letters, 2021, 6(2): 2634−2641 doi: 10.1109/LRA.2021.3062303 [143] Mu Y, Chen T, Peng S, et al. Robotwin: Dual-arm robot benchmark with generative digital twins (early version)[C]. European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 264-273. [144] Abou-Chakra J, Sun L, Rana K, et al. Real-is-Sim: Bridging the Sim-to-Real Gap with a Dynamic Digital Twin for Real-World Robot Policy Evaluation[J]. arXiv preprint arXiv: 2504.03597, 2025. [145] Cakir L V, Al-Shareeda S, Oktug S F, et al. How to synchronize digital twins? a communication performance analysis[C]. IEEE 28th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks, 2023: 123-127. [146] Liu S, Zhang B, Huang Z. Benchmark real-time adaptation and communication capabilities of embodied agent in collaborative scenarios[J]. arXiv preprint arXiv: 2412.00435, 2024. [147] Kumar A, Fu Z, Pathak D, et al. RMA: Rapid Motor Adaptation for Legged Robots[J]. Robotics: Science and Systems XVII, 2021. [148] Kumar A, Li Z, Zeng J, et al. Adapting rapid motor adaptation for bipedal robots[C]. IEEE/RSJ International Conference on Intelligent Robots and Systems, 2022: 1161-1168. [149] Atreya P, Pertsch K, Lee T, et al. RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies[J]. arXiv preprint arXiv: 2506.18123, 2025. [150] Srivastava S, Li C, Lingelbach M, et al. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments[C]. Conference on robot learning. PMLR, 2022: 477–490. [151] Krantz J, Wijmans E, Majumdar A, et al. Beyond the nav-graph: Vision-and-language navigation in continuous environments[C]. European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 104–120. [152] James S, Ma Z, Arrojo D R, et al. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 2020, 5(2): 3019−3026 doi: 10.1109/LRA.2020.2974707 [153] Yu T, Quillen D, He Z, et al. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning[C]. Conference on robot learning. PMLR, 2020: 1094–1100. [154] Gupta A, Kumar V, Lynch C, et al. Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning[C]. Conference on Robot Learning. PMLR, 2020: 1025-1037. [155] Mees O, Hermann L, Rosete-Beas E, et al. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 2022, 7(3): 7327−7334 doi: 10.1109/LRA.2022.3180108 [156] Liu B, Zhu Y, Gao C, et al. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems, 2023, 36: 44776−44791 [157] Tassa Y, Doron Y, Muldal A, et al. Deepmind control suite[J]. arXiv preprint arXiv: 1801.00690, 2018. [158] Zhou X, Han X, Yang F, et al. OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model[J]. Computing Research Repository, 2025. [159] Sautenkov O, Yaqoot Y, Lykov A, et al. UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation[C]. ACM/IEEE International Conference on Human-Robot Interaction. IEEE, 2025: 1588-1592. [160] Eslami S, de Melo G. Mitigate the gap: Investigating approaches for improving cross-modal alignment in clip[J]. arXiv preprint arXiv: 2406.17639, 2024. [161] Song S, Li X, Li S, et al. How to bridge the gap between modalities: A comprehensive survey on multimodal large language model[J]. arXiv preprint arXiv: 2311.07594, 2023. [162] Chen Y, Ding Z, Wang Z, et al. Asynchronous large language model enhanced planner for autonomous driving[C]. European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 22-38. [163] Qian C, Yu X, Huang Z, et al. SpotVLM: Cloud-edge Collaborative Real-time VLM based on Context Transfer[J]. arXiv preprint arXiv: 2508.12638, 2025. [164] Kwon T, Di Palo N, Johns E. Language models as zero-shot trajectory generators. IEEE Robotics and Automation Letters, 2024, 9(7): 6728−6735 doi: 10.1109/LRA.2024.3410155 [165] Wei H, Zhang Z, He S, et al. PlanGenLLMs: A Modern Survey of LLM Planning Capabilities[C]. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), 2025. [166] Bruce J, Dennis M D, Edwards A, et al. Genie: Generative interactive environments[C]. Forty-first International Conference on Machine Learning. 2024. [167] Kang B, Yue Y, Lu R, et al. How far is video generation from world model: A physical law perspective[J]. International Conference on Machine Learning, 2025. -
计量
- 文章访问数: 13
- HTML全文浏览量: 7
- 被引次数: 0