• 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

面向具身操作的视觉−语言−动作模型综述

李浩然 陈宇辉 崔文博 刘卫恒 刘锴 周明才 张正涛 赵冬斌

李浩然, 陈宇辉, 崔文博, 刘卫恒, 刘锴, 周明才, 张正涛, 赵冬斌. 面向具身操作的视觉−语言−动作模型综述. 自动化学报, 2026, 52(1): 1−34 doi: 10.16383/j.aas.c250394
引用本文: 李浩然, 陈宇辉, 崔文博, 刘卫恒, 刘锴, 周明才, 张正涛, 赵冬斌. 面向具身操作的视觉−语言−动作模型综述. 自动化学报, 2026, 52(1): 1−34 doi: 10.16383/j.aas.c250394
Li Hao-Ran, Chen Yu-Hui, Cui Wen-Bo, Liu Wei-Heng, Liu Kai, Zhou Ming-Cai, Zhang Zheng-Tao, Zhao Dong-Bin. Survey of vision-language-action models for embodied manipulation. Acta Automatica Sinica, 2026, 52(1): 1−34 doi: 10.16383/j.aas.c250394
Citation: Li Hao-Ran, Chen Yu-Hui, Cui Wen-Bo, Liu Wei-Heng, Liu Kai, Zhou Ming-Cai, Zhang Zheng-Tao, Zhao Dong-Bin. Survey of vision-language-action models for embodied manipulation. Acta Automatica Sinica, 2026, 52(1): 1−34 doi: 10.16383/j.aas.c250394

面向具身操作的视觉−语言−动作模型综述

doi: 10.16383/j.aas.c250394 cstr: 32138.14.j.aas.c250394
基金项目: 国家自然科学基金(62136008, 62173324)资助
详细信息
    作者简介:

    李浩然:中国科学院自动化研究所副研究员. 2015年获得中南大学学士学位, 2020年获得中国科学院自动化研究所控制理论与控制工程专业博士学位. 主要研究方向为具身智能, 强化学习和机器人学习. E-mail: lihaoran2015@ia.ac.cn

    陈宇辉:中国科学院自动化研究所博士研究生. 2022年获得北京理工大学学士学位和澳大利亚国立大学学士学位. 主要研究方向为具身智能、强化学习和机器人学习. E-mail: chenyuhui2022@ia.ac.cn

    崔文博:中国科学院自动化研究所博士研究生. 2020年获得东北农业大学学士学位, 2023年获得大连理工大学硕士学位. 主要研究方向为三维计算机视觉, 具身智能. E-mail: cuiwenbo2023@ia.ac.cn

    刘卫恒:中国科学院自动化研究所博士研究生. 2024年获得北京航空航天大学学士学位. 主要研究方向为具身智能, 强化学习和3D视觉. E-mail: weihliu2002@gmail.com

    刘锴:中国科学院自动化研究所博士研究生. 2025年获得西安交通大学学士学位. 主要研究方向为具身智能和世界模型. E-mail: liukai2025@ia.ac.cn

    周明才:中国科学院自动化研究所副研究员. 2010年获得中国科学院自动化研究所博士学位. 主要研究方向为具身智能, 计算机视觉, 机器学习, 计算机图形学, 增强现实. E-mail: zhengtao.zhang@ia.ac.cn

    张正涛:中国科学院自动化研究所工业视觉与智能装备技术工程实验室教授. 2004年获得中国石油大学学士学位, 2007年获得北京理工大学硕士学位, 2010年获得中国科学院自动化研究所控制科学与工程专业博士学位. 主要研究方向为工业视觉检测, 智能机器人. E-mail: mingcai.zhou@ia.ac.cn

    赵冬斌:中国科学院自动化研究所研究员. 他分别于1994年、1996年和2000年获得哈尔滨工业大学学士学位、硕士学位和博士学位. 主要研究方向为强化学习, 具身智能, 智能驾驶, 智能博弈. 本文通信作者.  E-mail: dongbin.zhao@ia.ac.cn

Survey of Vision-language-action Models for Embodied Manipulation

Funds: Supported by National Natural Science Foundation of China (62136008, 62173324)
More Information
    Author Bio:

    LI Hao-Ran Associate researcher at the Institute of Automation, Chinese Academy of Sciences. He received his bachelor degree from Central South University in 2015, and his Ph.D. degree in control theory and control engineering from Institute of Automation, Chinese Academy of Sciences in 2020. His research interests include embodied intelligence, reinforcement learning, and robotic learning

    CHEN Yu-Hui Ph.D. candidate at the Institute of Automation, Chinese Academy of Sciences. He received his bachelor degree from Beijing Institute of Technology and Australian National University in 2022. His research interests include embodied intelligence, reinforcement learning and robotic learning

    CUI Wen-Bo Ph.D. candidate at the Institute of Automation, Chinese Academy of Sciences. He received his bachelor degree from Northeast Agricultural University in 2020, and his master degree from Dalian University of Technology in 2023. His research interests include 3D computer vision and embodied intelligence

    LIU Wei-Heng Ph.D. candidate at the Institute of Automation, Chinese Academy of Sciences. He received his bachelor degree from Beihang University in 2024. His research interests include embodied intelligence, reinforcement learning, and 3D vision

    LIU Kai Ph.D. candidate at the Institute of Automation, Chinese Academy of Sciences. He received his bachelor degree from Xi'an Jiaotong University in 2025. His research interests include embodied intelligence and world models

    ZHOU Ming-Cai Associate researcher at the Institute of Automation, Chinese Academy of Sciences. He received his Ph.D. from the Institute of Automation, Chinese Academy of Sciences in 2010. His research interests include embodied intelligence, computer vision, machine learning, computer graphics, augmented reality

    ZHANG Zheng-Tao Professor at the Engineering Laboratory for Industrial Vision and Intelligent Equipment Technology, Institute of Automation, Chinese Academy of Sciences. He received his bachelor degree from China University of Petroleum in 2004, his master degree from Beijing Institute of Technology in 2007, and his Ph.D. degree in control science and engineering from the Institute of Automation, Chinese Academy of Sciences in 2010. His research interests include industrial vision inspection, and intelligent robotics

    ZHAO Dong-Bin Researcher at the Institute of Automation, Chinese Academy of Sciences. He received his bachelor, master, Ph.D. degrees from Harbin Institute of Technology, in 1994, 1996, and 2000, respectively. His research interests include reinforcement learning, embodied intelligence, intelligent driving, and intelligent game. Corresponding author of this paper

  • 摘要: 具身智能系统通过智能体与环境不断交互, 从而提升智能体能力, 受到学术界和产业界的广泛关注. 视觉−语言−动作模型作为一种受到大模型发展启发的机器人通用控制模型, 提高了具身智能系统中智能体与环境交互的能力, 大大扩展了具身智能机器人的应用场景. 本文对具身操作中的视觉−语言−动作模型进行综述. 首先, 详细介绍视觉−语言−动作模型的发展历程. 然后, 对视觉−语言−动作模型架构、训练数据、预训练方法、后训练方法和模型评估5个方面的研究现状进行详细分析. 最后, 针对视觉−语言−动作模型发展过程和落地应用中面临的挑战和未来可能的发展方向进行总结.
  • 图  1  具身操作

    Fig.  1  Embodied manipulation.

    图  2  VLA模型时间线

    Fig.  2  The timeline of VLA models

    图  3  VLA模型架构

    Fig.  3  The framework of VLA models

    图  4  观测编码

    Fig.  4  Observation encoder

    图  5  特征推理

    Fig.  5  Feature reasoning

    图  6  动作解码

    Fig.  6  Action decoder

    图  7  分层系统

    Fig.  7  Hierarchical system

    图  8  数据金字塔和VLA预训练方法(彩色箭头表示不同的训练方法所使用的数据类型. 在每个训练方法中, 红色$ \rightarrow $表示使用数据的顺序, 红色$ + $表示不同的数据一起使用实现联合训练)

    Fig.  8  Data pyramid and VLA pre-training methods(The colored arrows represent the data types used by different training methods. In each training method, the red $ \rightarrow $ indicates the order in which the data is used, and the red $ + $ indicates that different data are used together to achieve joint training)

    表  1  与其他VLA相关综述的对比

    Table  1  Comparison with other VLA-related surveys

    综述 发展历程 模型结构 数据集 预训练方法 后训练方法 模型评估
    [1] ×
    [2] × × ×
    [3] × × ×
    [4] × × × ×
    [5] × ×
    本文
    下载: 导出CSV

    表  2  数据集与相关方法汇总

    Table  2  Summary of datasets and related methods

    分类名称描述规模支持任务相关方法
    互联网图文数据CapsFusion[102]大规模图像−文本对数据集, 为多模态预训练设计, 旨在解决现有图像−文本数据集的噪声问题和低质量标注问题1.2亿图像−文本对图像描述生成, 多模态预训练$ \pi_0 $[42]
    COCO[101]大规模图像数据集, 包含80个物体类别和91种材料类别, 每张图片5个语句描述, 且有25万个带关键点标注的行人33万图片目标检测, 实例分割、关键点检测, 图像描述生成$ \pi_0 $[42], ChatVLA[93], ChatVLA-2[94]
    GQA[105]大规模视觉问答数据集, 专注于真实世界的视觉推理和组合性问答2260万问题, 11.3万图像组合性推理, 视觉问答ChatVLA[93], ChatVLA-2[94]
    LAION-400M[121]大规模图像−文本对数据集, 包含图像URL、图像和图像描述的嵌入、图像与描述之间的相似性评分以及元数据4亿图像-文本对图文检索, 图文生成, 多模态预训练UniPi[45]
    PixMo[122]大规模图像−文本对数据集, 图像涵盖70多个主题, 每张图像描述由3位标注者通过语音生成71.2万张图像, 130万描述图像描述生成, 多模态预训练$ \pi_0 $[42]
    TextVQA[104]大规模视觉问答数据集, 要求模型理解图像中的文本内容来回答问题2.8万图片, 4.53万问题, 45.3万回答文本推理, 视觉问答ChatVLA[93], ChatVLA-2[94]
    VQAv2[103]大规模开放式问答数据集, 由人工标注, 面向开放世界视觉问答任务26.5万图片, 44.3万问题, 443百万回答常识推理, 视觉问答$ \pi_0 $[42]
    WebLI[123]超大规模多语言图像−文本对数据集, 涵盖36种语言和多样化文化背景, 包含13亿图像−文本对, 旨在提升视觉语言模型在全球范围内的泛化能力与文化适应性100亿图像−文本对光学字符识别, 图文检索, 图像描述生成, 视觉问答, 多模态预训练RT-2[10]
    视频数据Ego-4D[108]大规模第一人称视角视频数据集, 涵盖数百种场景, 由来自全球74个地点和9个不同国家的931名参与者拍摄3670小时视频视频理解, 多模态感知GO-1[63], GR-1[47], GR00T N1[57], Magma[124], UniVLA[64]
    Ego-Exo-4D[109]由Ego-4D数据集扩展的大规模第一/第三人称视角的多模态视频数据集, 增加多视角同步捕捉, 专注于技能活动研究1286小时视频跨视角表征学习, 技能理解, 多模态感知GR00T N1[57]
    EPIC-KITCHENS-100[107]大规模第一人称视角视频数据集, 包含45个厨房环境下的动作识别, 捕捉了多种家庭活动, 包括9万个动作100小时视频, 2000万帧动作识别, 环境理解, 多模态推理, 场景泛化ARM4R[69], CoT-VLA[65], GR-2[48], GR00T N1[57], Magma[124], HPT[33]
    Howto100M大规模叙述视频数据集, 主要是教学视频, 其中内容创建者教授复杂的任务, 并明确解释屏幕上的视觉内容1.36亿视频片段图像描述生成, 多模态预训练GR-2[48]
    Kinetics-700[125]大规模视频数据集, 涵盖700种人类动作类别, 包含人与物体及人与人之间的互动65万个视频动作识别, 视频理解GR-2[48]
    Something-Something V2[106]大规模带标记视频数据集, 包含人类使用日常物品执行的174种基本动作22万视频片段动作识别, 自监督学习, 多模态推理CoT-VLA[65], GR-2[48], LAPA[49], Magma[124], TriVLA[54], VPP[46]
    仿真数据DexMimicGen[116]大规模仿真数据集, 涵盖涉及精密操作和灵巧手场景下多种复杂操作任务, 通过人类演示与仿真生成2.1万条轨迹灵巧操作学习, 精细控制, 仿真到现实迁移GR00T N1[57], GR00T N1.5[57]
    RoboCasa[113]大规模仿真数据集, 提供120种厨房场景与2500个3D物体, 结合大语言模型生成任务与自动轨迹生成, 支持通用机器人操作与策略学习超过10万轨迹策略学习, 环境理解, 多模态预训练, 仿真到现实迁移GR00T N1[57]
    SynGrasp-1B[114]大规模合成动作数据集, 专注于机器人抓取技能的学习, 涵盖240个物体类别和1万个物体10亿帧抓取策略学习, 仿真到现实迁移, 跨任务泛化GraspVLA[114]
    真实机器人数据AgiBot World[63]大规模多场景数据集, 涵盖家居、餐饮、工业、商超及办公5大核心场景, 涵盖超过100种真实场景和3000多种日常物品, 其中80%的任务为长程任务100多万轨迹, 2976.4小时交互数据多任务学习, 跨场景泛化, 多模态预训练, 仿真与真实结合训练GO-1[63], GR00T N1[57]
    BC-Z[20]大规模机器人模仿学习数据集, 涵盖100种操作任务, 通过专家远程操作与自主收集, 支持零样本任务泛化和语言与视频条件下的策略学习2.58万条轨迹多任务学习, 跨场景泛化, 多模态预训练$ \pi_0 $[42], CoT-VLA[65], TraceVLA[126], UniPi[45]
    Bridge Data[118]大规模多任务操作数据集, 涵盖使用WidowX机械臂在10个环境中收集的71个厨房任务7200条轨迹多任务学习, 跨场景泛化, 多模态预训练UniPi[45], GR-2[48]
    Bridge Data V2[119]大规模多任务操作数据集, 使用WidowX机械臂在24个环境中收集, 涵盖广泛任务与环境变化, 支持图像和语言条件下的多任务学习与技能泛化6万条轨迹多任务学习, 跨场景泛化, 多模态预训练$ \pi_0 $[42], ECoT[127], LAPA[49], NORA[128], RDT[31], TraceVLA[126]
    DROID[159]大规模真实机器人操作数据集, 覆盖564个多样化场景和86种任务类型, 支持丰富动作和环境组合, 促进机器人通用操作技能学习7.6万条轨迹, 350小时交互数据多任务学习, 多机器人协同学习, 跨场景泛化, 多模态预训练$ \pi_0 $[42], DiVLA[129], DreamVLA[67], HybridVLA[99], NORA[128], RDT[31], SpatialVLA[68], UniAct[130]
    Moblie ALOHA[131]大规模数据集, 支持双臂移动操作, 涵盖厨房、实验室等多场景下的复合任务学习, 融合人类示教与自动化采集两种采集范式500轨迹多任务学习, 导航学习, 多模态预训练RDT[31]
    FrodoBots-2k大规模多模态数据集, 遥控操作收集涵盖视频、GPS、IMU、音频与人类控制数据, 覆盖全球10多座城市, 支持移动机器人导航与感知研究2000小时交互数据驾驶策略学习, 跨场景泛化, 多模态预训练HPT[33]
    OXE[32]大规模多机器人操作数据集, 涵盖22种机器人的527种技能和16万项任务, 提供标准化格式支持, 促进跨形态经验迁移与通用策略学习超过100万条轨迹多任务学习, 跨场景泛化, 多模态预训练$ \pi_0 $[42], CogACT[40], CoT-VLA[65], DiVLA[129], GR00T N1[57], HPT[33], HybridVLA[99], LAPA[49], NORA[128], RDT[31], RoboVLMs[44], SpatialVLA[68], TriVLA[54], UniAct[130], UniVLA[64], VPP[46]
    RDT-1B[31]大规模机器人操作数据集, 涵盖单臂、双臂与移动机械臂等多种机器人形态超过100万轨迹多任务学习, 跨形态泛化, 跨场景泛化, 多模态预训练RDT[31]
    RH20T[132]大规模多模态机器人操作数据集, 包含4种主流机械臂、4种夹爪和3种力传感器共7种机器人硬件配置组合, 涵盖147种任务与42种技能11万序列, 5千万帧力觉感知融合, 多形态技能泛化, 多模态预训练RDT[31]
    RoboSet[133]大规模真实机器人操作数据集, 专注于厨房环境, 包含动觉示教与遥操示教的多视角轨迹及丰富场景变化2.85万条轨迹多任务学习, 变化场景适应, 多模态预训练RDT[31]
    RoboMIND[134]大规模机器人操作数据集, 涵盖479种任务、96种物体类别, 38种操作技能及多种机械臂与人形机器人, 支持任务执行性能提升与失败案例分析10.7万成功轨迹, 5000失败轨迹多任务学习, 失败分析与自适应改进, 多模态预训练HybridVLA[99]
    RT-1[21]大规模真实机器人数据集, 包含13台机械臂上采集的带语言指令标注的视频, 涵盖700多种任务, 支持零样本泛化和复杂操作技能学习13万视频片段多任务学习, 跨场景泛化, 多模态预训练Gen2Act[135], GR-2[48], RDT[31], RT-1[21], RT-2[10], TraceVLA[126]
    下载: 导出CSV

    表  3  VLA后训练方法汇总

    Table  3  Summary of VLA post-training methods

    类别 相关工作 主要贡献/优势 缺陷 适用场景 发表 年份
    监督微调 $ \pi_{0} $ [42] 提出基于流匹配的动作解码器, 在高质量真实机器人数据上通过监督微调, 有效提升模型在复杂、长程任务中的执行稳定性与成功率 对专家数据的质量与覆盖度较为敏感, 分布外泛化与跨场景迁移受限, 需要进行域内微调 适用于长时序任务及不便进行在线探索的真实部署环境 RSS 2025
    GO-1[63] 在海量互联网异构视频与真实机器人数据上预训练, 并结合MoE结构, 仅需少量真实数据的监督微调即可快速适应新任务与新场景, 显著降低数据需求 预训练数据噪声与域间分布差异产生错误对齐; MoE带来训练与推理的系统复杂度和算力开销, 缺乏稀有技能与密集接触场景真机数据 数据相对稀缺但任务多样、需快速落地的新场景迁移 IROS 2025
    GR00T N1[57] 在多样化人形机器人感知与控制数据上预训练, 结合快速反射与规划推理的双系统架构, 后训练既具备高反应速度, 又能进行复杂任务规划, 在多种场景中展现出鲁棒的人形机器人控制能力 依赖大规模高质量人形数据与复杂系统集成, 训练与推理的算力/工程成本高; 密集接触工况或强约束环境中需要额外安全策略与调参 需要同时具备灵敏即时反应和高层规划的人形机器人应用, 例如服务场景、多步骤装配、移动操作与协作任务 arXiv 2025
    GR-1[47] 在40万条跨形态机器人数据上进行模仿预训练, 构建首个开放的多任务、多机器人形态统一策略基线, 在少量目标机器人真实数据上通过监督微调进行后训练, 实现“单模型多机器人”控制的可行性, 并显著降低多形态适配成本 依赖大规模异构数据的质量与覆盖度;对极端工况/特定形态的细粒度控制可能仍需额外调参, 且通才策略在个别边缘任务上可能不如专才策略 多形态、多任务的统一部署与快速落地, 以及对维护成本敏感、需快速适配新形态的应用 ICLR 2024
    GR-2[48] 在GR-1框架基础上加入视觉−语言−动作三模态对齐, 并引入更大规模的互联网视频自监督数据进行预训练, 在少量目标机器人真实数据上通过监督微调进行后训练, 进一步提升了复杂指令理解和跨场景任务执行的泛化能力 依赖互联网视频与多模态对齐质量, 可能受噪声与域间偏移影响; 模型规模与训练/对齐流程复杂, 密集接触任务的安全性与精细控制仍需额外工程与人为监督 真机数据稀缺但可获取大量弱标视频的应用; 需快速适配新环境/新任务的指令驱动型操作; 跨机器人形态迁移与多步骤长程任务执行 arXiv 2024
    GR-3[138] 采用跨域数据联合训练, 将预训练数据规模扩展至百万级, 重点强化语言指令理解与零样本跨环境泛化能力, 在少量目标机器人真实数据上通过监督微调进行后训练, 显著提升了在未见任务与新环境中的执行稳定性和成功率 高度依赖大规模异构数据的质量与对齐, 数据清洗与标注成本高; 训练与部署的系统/算力开销较大, 密集接触与高精度控制仍可能需要额外专用微调策略 数据分布多样、需频繁迁移的新环境任务; 指令驱动的长程多步骤操作 arXiv 2025
    Helix[56] 首个人形VLA, 在Figure人形的大规模感知与控制数据上进行预训练, 可对整个人形上半身输出高频连续控制, 在少量目标任务的真实机器人数据上通过监督微调进行后训练, 实现精确且稳定的上肢协调控制 训练数据仅限于当前机器人形态, 迁移到异构人形可能需额外标定与微调;高频连续控制带来训练与推理算力/实时性压力, 密集接触场景仍需安全机制与细致调参 需要精细上肢操作与稳定协同的人形应用, 例如装配、工具使用、开关/旋钮操作与服务场景 arXiv 2025
    HPT[33] 分层提示微调, 将LLM生成的文字描述拆分为“层次化结构”与“语义文本”两路提示并同步学习, 在多任务机器人数据上预训练后, 用少量目标任务的真实数据监督微调后训练, 保持语言理解能力同时显著提升任务执行的稳定性与泛化性 需要高质量的层次化指令生成与标注, 分层提示设计与超参较多、工程复杂度偏高; 对密集接触/高频闭环控制仍可能需额外控制器或安全机制配合 语义复杂、步骤明确的多步骤操作, 例如装配、烹饪式流程、服务机器人任务 NeurIPS 2024
    Magma[124] 微软提出的多模态基础模型, 可同时感知视觉与语言并输出动作, 在少量目标机器人真实数据上通过监督微调进行后训练, 实现了从数字智能体到实体机器人的高效迁移, 在真实环境任务中表现出稳定的感知−行动能力 预训练与对齐流程复杂, 对多模态同步标注与时间对齐敏感; 推理开销与系统集成成本较高, 密集接触或高精度操作仍需专才策略与安全约束 需要从仿真/视频智能体快速落地到真实机器人、任务多样且数据相对有限的应用, 例如服务机器人、仓储与装配等真实场景的多任务部署 CVPR 2025
    Octo[30] 首个完全开源的通用Transformer Diffusion架构, 在少量目标机器人真实数据上通过监督微调进行后训练, 实现跨机器人形态的快速适配与任务迁移, 在多样任务中保持高执行性能 扩散式动作生成推理开销较大, 对密集接触/高精度操作仍可能需要专用微调方法与安全机制; 性能对专家数据分布与对齐质量较为敏感 跨形态迁移与多任务统一基线搭建、研究与工业落地的可复现方案, 以及以少量目标数据完成快速适配的真实部署 CoRL 2023
    OpenVLA[39] 70亿参数开源VLA, 在少量目标机器人真实数据上通过监督微调后训练, 内置多机器人形态的适配能力, 从而实现高效迁移, 显著降低跨形态适配成本并且可以保持任务的执行性能 模型规模与推理开销较大, 对实时性与边缘设备部署有压力; 在密集接触/高精度任务上仍依赖高质量对齐与额外标定和安全机制 作为跨形态统一基线与研究/工业的可复现方案, 在数据有限的场景进行快速适配与任务迁移, 可在多机器人形态间共享策略 CoRL 2024
    RDT[31] 首个双臂操作扩散基础模型, 在稀缺数据场景下能够生成多模态动作分布, 在少量目标任务的真实双臂机器人数据上通过监督微调进行后训练, 显著提升在复杂协作操作任务中的稳定性与成功率 扩散生成带来推理时延与算力开销, 对实时高频控制有压力; 对精细力控与密集接触场景仍需额外传感/控制层或专用微调方法, 且对演示对齐质量较敏感 双臂协作的装配、整理、搬运与工具协同等任务, 数据有限但需高稳定性的工业/服务场景 ICLR 2025
    监督微调 RoboFlamingo[37] 以OpenFlamingo作为视觉−语言底座, 在少量目标机器人真实演示数据上通过监督微调进行后训练, 在多任务指令条件下显著提升执行成功率与泛化能力 性能对指令−动作对齐质量与专家轨迹丰富度敏感; 在密集接触、强时延约束或高频闭环控制任务上仍受限, 需要额外控制器与安全机制 指令驱动的多任务桌面操作与服务场景、数据有限但可快速收集少量演示的部署 ICLR 2024
    RT-2[10] 使用互联网数据和机器人轨迹数据预训练实现端到端机器人控制, 在少量目标机器人的真实演示数据上通过动作映射进行后训练, 赋予模型“语义推理”能力, 并在未见场景下显著提升任务成功率 依赖大规模跨域数据的对齐与清洗; 对密集接触/高精度控制需额外控制器与安全机制支持, 推理与部署成本较高; 操作泛化能比较差 指令驱动、语义复杂且环境多变的服务/家居/仓储任务 CoRL 2023
    UniVLA[64] 将视觉、语言与动作离散化为统一令牌序列, 并用单一自回归Transformer进行统一建模, 在结合世界模型预训练后, 在少量目标机器人真实数据上通过监督微调进行后训练, 显著提升了长时序任务的迁移能力与执行稳定性 离散化与自回归解码在高频控制下存在时延与信息损失风险; 世界模型预训练和对齐流程复杂, 对数据质量与时间对齐敏感 需要长程规划与多步骤执行的指令驱动任务、跨形态/跨场景迁移 RSS 2025
    VPP[46] 将视频扩散模型的未来表征嵌入策略网络, 以隐式方式学习逆动力学, 在少量目标任务的真实机器人数据上通过监督微调进行后训练, 显著提升长时预测下的控制稳定性与样本效率 依赖视频扩散模型的质量与时序对齐, 训练/推理开销较高; 在密集接触与精细力控任务中可能存在动力学失配, 需额外控制器或微调 长程、多步骤、需要前瞻规划的操作, 例如装配、整理、导航取放 ICML 2025
    强化微调 ConRFT[145] 在预训练VLA的基础上, 采用一致性策略并结合人为干预, 通过在线强化学习在真实机器人上进行强化微调后训练; 仅需45$ \sim $90 min即可将任务成功率提升至96% 需要在线交互与人类干预, 工程与系统集成复杂度较高; 对一致性目标/超参较敏感, 迁移到密集接触或新设备时仍需额外调参与校准 真实机器人上的快速任务适配与性能冲刺, 尤其是密集接触、风险较高且需稳定性的工业/服务场景 RSS 2025
    GRAPE[146] 利用VLM将复杂任务分解为子目标并生成轨迹级偏好奖励, 在真实机器人交互数据上通过直接偏好优化(DPO)进行后训练, 无需额外人工标注即可提升任务成功率和与人类偏好的一致性 偏好可信度受VLM评估与分解粒度影响, 可能引入噪声或阶段性误导; 对长程依赖与接触密集场景仍需精心设计阶段/约束, 训练稳定性对数据分布较敏感 难以手工设计奖励、但能获取交互数据的真实部署; 需要对齐人类偏好(安全、舒适、效率等)的服务/家居/协作操作以及多目标权衡的长程任务 ICRA 2024
    iRe-VLA[147] 提出迭代式RL-SFT环(内环强化微调, 外环监督微调), 在真实与仿真任务数据上仅更新轻量动作解码器进行后训练, 实现高样本效率的稳定收敛, 并在多任务中保持良好的泛化性能 依赖在线交互与环路调度, 例如PPO超参、数据混合比例, 对安全与复位机制有要求; 骨干冻结限制了感知侧的进一步提升 真机/仿真均可交互、真实数据有限但需快速适配与稳健收敛的多任务部署 ICRA 2025
    PARL[148] 在预训练VLA的基础上, 通过Q函数迭代优化动作, 并以模仿学习方式学习这些优化后的动作, 在真实机器人数据上进行后训练, 稳定提升模型的任务执行性能 质量高度依赖Q函数的准确性与覆盖度; 若Q学习出现过估计, 将把错误信号蒸馏进策略; 在线阶段仍需一定交互与工程调参(候选采样、优化步数等) 具备一定离线数据或Q容易训练的真实/仿真操控任务; 希望在不改动大模型骨干的前提下, 以低风险方式持续提升任务执行性能的工业与服务场景 ICLR 2024
    Policy Decorator[149] 将大型离线模仿策略作为基础策略, 在真实机器人数据上在线叠加可学习残差控制器, 结合受控探索与信任域优化进行强化微调后训练, 实现对下层策略模型不可知、稳定且高效的性能提升 性能上限仍受基础策略能力与误差耦合制约; 残差与主策略的协同需要细致的权重/约束设计, 可能引入额外超参调试成本 已有成熟模仿策略部署、需在真实环境中快速提效且不希望改动主干的部署 ICLR 2025
    ReinboT[150] 将强化学习的累计回报目标显式融入VLA损失函数, 在真实机器人数据上通过稳定的训练流程进行训练, 提升任务执行性能并保持训练收敛的稳定性 价值学习容易受到奖励设计/标注与时序的影响; 引入回报条件与价值分支增加系统与算力开销 具备一定真实交互或离线回放数据、希望在不大改骨干的前提下系统性提效的多任务部署 ICML 2025
    RIPT-VLA[151] 在1-demo监督微调起点上, 通过交互式强化微调并结合RLOO确保梯度稳定, 在仿真环境中完成全部实验, 成功率可提升至97%, 验证了在极少示例条件下的高效任务学习能力 主要在仿真中验证, 真实部署的感知噪声与复位/安全成本未充分评估; 对二元成功信号与采样分组策略敏感, 探索策略易陷入局部最优 数据极度稀缺但可进行大量模拟交互的场景; 具有明确成败判据的短/中程操控任务 CVPR 2025
    RLDG[152] 在真实机器人上通过强化学习生成高质量的自监督轨迹, 并将这些“内生数据”蒸馏回大模型, 无需额外人类示范即可迭代提升模型性能, 实现真机环境下的高效自我改进 依赖稳健的在线RL基础设施与安全复位机制, 训练易受奖励设计/不稳定性影响; 真机交互与算力成本仍不低, 且策略采样数据可能带来偏置与遗忘风险, 需要精心的数据筛选与蒸馏配方 具备可自动判定成败/奖励的操作任务与可用机器人集群的场景; 希望以极少人类示范实现长期、持续迭代学习的真实部署 RSS 2025
    TGRPO[153] 在GRPO基础上进行拓展, 同时利用时间步级与轨迹级奖励对模型进行微调, 全部实验均在仿真环境中完成, 显著提升连续控制任务中的动作质量与长时序一致性 目前主要在仿真环境中验证, 真实环境中的感知噪声、延迟与安全约束尚未系统评估; 对组内采样与优势归一化等超参较敏感, 奖励/成功判据设计不当可能削弱收益 需要长程依赖与连续控制的多步骤操控, 例如装配、抓取−放置序列 arXiv 2025
    强化微调 VLA-RL[154] 在预训练VLA的基础上, 结合PPO与RPRM[154] 进行微调, 并利用并行仿真环境加速训练, 全部实验均在仿真中完成, 实现了更快的收敛速度与更高的任务执行性能 基于仿真验证, 真实环境的传感噪声、延迟与安全约束尚未充分评估; 超参与奖励设计敏感, 仿真到现实存在潜在性能落差 需要大规模离线/并行仿真进行策略筛选与组合搜索的多任务操控 arXiv 2025
    推理扩展 FOREWARN[155] 利用VLM的轨迹评估能力, 对VLA生成的多个候选动作规划进行评估筛选, 在真实机器人数据上完成全部训练与验证, 避免奖励函数设计和值函数学习的需求, 并提升任务执行的稳定性与成功率 依赖VLM评估的可靠性与无偏性, 易受分布外场景与部分可观测性的影响; 多候选采样与评估带来计算与时延开销 难以手工设计奖励的真实部署任务、需要快速稳健提效而不改动训练流程的场景, 以及评估器判断可靠性高下的工业与服务机器人应用 RSS 2025
    Hume[60] 训练分层控制系统的S2系统通过预测值函数来评估生成动作序列的质量, 在仿真与真实机器人数据上进行训练与验证, 提升动作决策的可靠性与任务执行性能 分层架构与价值评估带来系统与推理复杂度、实时性开销; 价值学习对数据与超参敏感 需要深度推理与高频控制并存的复杂长程任务, 例如装配、工具使用与多步骤服务操作 arXiv 2025
    ITPS[156] 在动作生成过程中允许人类通过交互方式输入轨迹偏好, 并通过引导扩散过程生成期望的动作序列, 在仿真与真实机器人数据上进行训练和验证, 提升模型对人类意图的响应能力与任务执行的可控性 需在线人机交互与额外推理开销; 偏好可行度低或引导强度不当可能导致过度约束或目标漂移, 对实时性与稳定性提出更高要求 需要按用户偏好动态定制行为的服务/协作任务、对安全与舒适有要求的真实部署 ICRA 2025
    RoboMonkey[157] 提出一种“采样−验证”的推理期扩展框架, 并验证了动作误差与生成样本数量之间近似符合幂律关系, 在仿真与真实机器人数据上进行评估, 有效降低推理过程中的动作错误率并提升任务成功率 依赖高质量验证器与评估信号, 分布外场景可能失效; 多候选采样与验证增加推理时延与算力开销, 在实时高频控制与密集接触任务中需谨慎权衡 允许较大推理预算以换取稳健性的部署, 例如离线规划、低速高精度操作、关键任务执行前的安全校验 RSS 2025
    V-GPS[158] 针对同一指令在VLA上并行采样多条候选动作轨迹, 并利用实时视觉估计进行评分, 选择并平滑最优轨迹后下发控制, 在仿真与真实机器人数据上进行验证, 有效提升了安全性与成功率 依赖在线视觉感知与评分器可靠性, 多候选采样与评估增加推理时延与算力开销; 在密集接触操作中, 视觉滞后与估计误差导致评分偏差 对安全与稳健性要求高、允许一定推理预算的真实部署, 例如抓取与装配、拥挤/狭窄环境操作 CoRL 2024
    下载: 导出CSV

    表  4  不同VLA模型在真实环境种的测试结果[168]

    Table  4  Evaluation Results of different VLA models in real-world environment[168]

    排名 VLA模型 Score SD A/B Evals
    1 $ \pi_{0.5} $-DROID 1883 26.1 339
    2 PaliGemma-FAST-specialist-DROID 1851 25.8 741
    3 $ \pi_{0} $-FAST-DROID 1814 24.8 505
    4 PaliGemma-VQ-DROID 1765 33.3 526
    5 PaliGemma-FAST-DROID 1759 35.5 723
    6 PaliGemma-Diffusion-DROID 1585 56.4 514
    7 DAM 1213 210.5 53
    8 $ \pi_{0} $-DROID 894 28.5 781
    9 PaliGemma-Bining-DROID 734 26.2 404
    下载: 导出CSV

    表  5  仿真器与VLA模型评估

    Table  5  Benchmarks and Simulators for VLAs

    仿真环境 仿真引擎 输入 机器人 任务 相关方法
    CALVIN[172] PyBullet RGB/D, 语言指令 Franka Emika Panda 长序列、语言指令驱动的桌面操作任务 DreamVLA[67], GR-1[47], GR-2[48], RoboFlamingo[37], RoboVLMs[44], TriVLA[54], UniVLA[64], UP-VLA[?], VPP[46]
    Franka-Kitchen[173] MuJoCo RGB/D Franka Emika Panda 厨房多物体交互, 多目标组合任务 HiRT[53]
    SimplerEnv[174] SAPIEN RGB/D, 语言指令 WindowX, Google Robot 多样化、语言指令驱动的桌面操作任务 CogACT[40], HPT[33], Hume[60], LAPA[49], Octo[30], OpenVLA[39], RoboVLMs[44], RT-1[21], SpatialVLA[68], TraceVLA[126], UniVLA[64]
    LIBERO[175] MuJoCo RGB/D, 语言指令 Franka Emika Panda 专注于终身学习的程序化生成任务 BitVLA[71], CoT-VLA[65], Fast ECoT[127], FLIP[176], Hume[60], NORA[128], OpenVLA[39], OpenVLA-OFT[76], SmolVLA[70], SpatialVLA[68], SP-VLA[177], TriVLA[54], UniAct[130], UniVLA[64], WorldVLA[66]
    Meta-World[178] MuJoCo RGB/D Sawyer 50种用于元学习/多任务学习的桌面操作任务 HiRT[53], HPT[33], TinyVLA[179], VPP[46]
    RLBench[180] CoppeliaSim RGB/D, 语言指令 Franka Emika Panda 100种大规模、带语言标注的多样化操作任务 HybridVLA[99]
    RoboMimic[181] MuJoCo RGB/D, 语言指令 Franka Emika Panda 基于人类演示的模仿学习任务集 HPT[33]
    下载: 导出CSV

    表  6  不同VLA模型在LIBERO[175]中的测试结果

    Table  6  Evaluation results of different VLA models in LIBERO[175]

    VLA模型 Spatial Object Goal Long Average
    $ \pi_0 $[42] 97% 99% 96% 85% 94%
    $ \pi_0 $-FAST[42] 96% 97% 89% 60% 86%
    BitVLA[71] 97% 99% 94% 88% 94%
    CoT-VLA[65] 88% 92% 88% 69% 84%
    Fast ECoT[127] 83% 85% 83% 69% 80%
    GR00T N1[57] 94% 98% 93% 91% 94%
    Hume[60] 97% 99% 99% 97% 98%
    NORA[128] 86% 88% 77% 45% 74%
    OpenVLA[39] 85% 88% 79% 54% 77%
    OpenVLA-OFT[76] 96% 98% 96% 91% 95%
    SmolVLA[70] 93% 94% 91% 77% 89%
    SpatialVLA[68] 88% 90% 79% 56% 78%
    SP-VLA[177] 75% 86% 84% 54% 75%
    TriVLA[54] 91% 94% 90% 73% 87%
    UniAct[130] 77% 87% 77% 70% 78%
    UniVLA[64] 97% 97% 96% 92% 95%
    WorldVLA[66] 73% 88% 80% 27% 67%
    下载: 导出CSV

    表  7  不同VLA模型在SimplerEnv[174]中的测试结果

    Table  7  Evaluation results of different VLA models in SimplerEnv[174]

    VLA模型 Google Robot WidowX Robot
    拿可乐罐 移动物体 开/关抽屉 把物体放进抽屉 把胡萝卜放到盘子里 把勺子放在毛巾上 叠方块 把鸡蛋放到篮子里
    $ \pi_0 $[42] 73% 65% 38% 0% 29% 17% 63%
    $ \pi_0 $-FAST[42] 75% 68% 43% 62% 22% 29% 83% 48%
    CogACT[40] 91% 85% 72% 51% 51% 72% 15% 68%
    HPT[172] 60% 24% 56%
    Hume[60] 97% 80% 59% 67% 58% 46% 73%
    LAPA[49] 46% 71% 54% 58%
    Octo-Small[30] 10% 47% 4% 57%
    Octo-Base[30] 17% 4% 23% 8% 13% 0% 43%
    OpenVLA[39] 16% 46% 36% 0% 0% 0% 4%
    RoboVLMs[44] 77% 62% 43% 24% 21% 46% 4% 79%
    RT-1[21] 3% 5% 14% 4% 0% 0% 0%
    SpatialVLA[68] 86% 78% 57% 75% 25% 17% 29% 43%
    TraceVLA[126] 44% 55% 44%
    UniVLA[64] 56% 53% 3% 81%
    下载: 导出CSV

    表  8  国内外产业界发布的VLA模型

    Table  8  Industrial VLA models released domestically and internationally

    公司 代表模型 应用场景
    国外 Google RT-2[10], Gemini Robotics[59] 通用桌面操作场景
    NVIDIA GR00T N1[57], GR00T N1.5 人形移动操作场景
    Figure AI Helix[56] 家居场景、工业场景、物流场景
    Physical Intelligence $ \pi_0 $[42], $ \pi_{0.5} $[77] 家居场景
    国内 字节跳动 RoboFlamingo[37], GR-1[47], GR-2[48], GR-3[138] 通用桌面操作场景
    阿里巴巴 RynnVLA-001[188] 桌面操作
    美的 DexVLA[189], ChatVLA[93], ChatVLA-2[94] 通用桌面操作场景
    银河通用 GraspVLA[114] 桌面抓放场景
    智元 G0-1[63] 桌面操作场景
    星海图 G0[190] 移动操作场景
    星尘智能 DuoCore-WB[191], ControlVLA[192] 移动操作场景、家居场景
    自变量 WALL-OSS[193] 桌面操作场景
    灵初智能 DexGraspVLA[55] 桌面操作场景
    下载: 导出CSV
  • [1] MA Y E, SONG Z X, ZHUANG Y Z, HAO J Y, KING I. A survey on vision-language-action models for embodied AI. arXiv preprint arXiv: 2405.14093, 2024.
    [2] SAPKOTA R, CAO Y, ROUMELIOTIS K I, KARKEE M. Vision-language-action models: concepts, progress, applications and challenges. arXiv preprint arXiv: 2505.04769, 2025.
    [3] ZHONG Y F, BAI F S, CAI S F, HUANG X C, CHEN Z, ZHANG X W, et al. A survey on vision-language-action models: an action tokenization perspective. arXiv preprint arXiv: 2507.01925, 2025.
    [4] XIANG T Y, JIN A Q, ZHOU X H, GUI M J, XIE X L, LIU S Q, et al. Parallels between VLA model post-training and human motor learning: progress, challenges, and trends. arXiv preprint arXiv: 2506.20966, 2025.
    [5] DIN M U, AKRAM W, SAOUD L S, ROSELL J, HUSSAIN I. Vision language action models in robotic manipulation: a systematic review. arXiv preprint arXiv: 2507.10672, 2025.
    [6] CHEN Y R, CUI W B, CHEN Y W, TAN M, ZHANG X Y, LIU J R, et al. RoboGPT: a LLM-based long-term decision-making embodied agent for instruction following tasks. IEEE Transactions on Cognitive and Developmental Systems, 2025, 17(5): 1163−1174 doi: 10.1109/TCDS.2025.3543364
    [7] JIN Y X, LI D Z, YONG A, SHI J, HAO P, SUN F C, et al. RobotGPT: robot manipulation learning from ChatGPT. IEEE Robotics and Automation Letters, 2024, 9(3): 2543−2550 doi: 10.1109/LRA.2024.3357432
    [8] 白辰佳, 徐华哲, 李学龙. 大模型驱动的具身智能: 研究与挑战. 中国科学: 信息科学, 2024, 54: 2035−2082

    BAI Chen-Jia, XU Hua-Zhe, LI Xue-Long. Embodied-AI with large models: research and challenges. Science China Information Sciences, 2024, 54: 2035−2082
    [9] 王文晟, 谭宁, 黄凯, 张雨浓, 郑伟诗, 孙富春. 基于大模型的具身智能系统综述. 自动化学报, 2025, 51(1): 1−19 doi: 10.16383/j.aas.c240542

    WANG Wen-Sheng, TAN Ning, HUANG Kai, ZHANG Yu-Nong, ZHENG Wei-Shi, SUN Fu-Chun. Embodied intelligence systems based on large models: a survey. Acta Automatica Sinica, 2025, 51(1): 1−19 doi: 10.16383/j.aas.c240542
    [10] ZITKOVICH B, YU T H, XU S C, XU P, XIAO T, XIA F, et al. RT-2: vision-language-action models transfer web knowledge to robotic control. In: Proceedings of Conference on Robot Learning. Atlanta, USA: PMLR, 2023. 2165–2183.
    [11] ZHEN H Y, QIU X W, CHEN P H, YANG J C, YAN X, DU Y L, et al. 3D-VLA: a 3D vision-language-action generative world model. In: Proceedings of International Conference on Machine Learning. Vienna, Austria: PMLR, 2024.
    [12] YU J W, LIU H R, YU Q J, REN J J, HAO C, DING H T, et al. ForceVLA: enhancing VLA models with a force-aware MoE for contact-rich manipulation. arXiv preprint arXiv: 2505.22159, 2025.
    [13] ZHANG C F, HAO P, CAO X G, HAO X S, CUI S W, WANG S. VTLA: vision-tactile-language-action model with preference learning for insertion manipulation. arXiv preprint arXiv: 2505.09577, 2025.
    [14] MATUSZEK C. Grounded language learning: where robotics and NLP meet. In: Proceedings of International Joint Conference on Artificial Intelligence. Stockholm, Sweden, 2018. 5687–5691.
    [15] STEPPUTTIS S, CAMPBELL J, PHIELIPP M J, LEE S, BARAL C, BEN AMOR H. Language-conditioned imitation learning for robot manipulation tasks. In: Proceedings of Neural Information Processing Systems. Virtual Event: Curran Associates, 2020. 12391–12402.
    [16] SHRIDHAR M, MANUELLI L, FOX D. CLIPort: what and where pathways for robotic manipulation. In: Proceedings of Conference on Robot Learning. Virtual Event: PMLR, 2021. 894–906.
    [17] RADFORD A, KIM J W, HALLACY C, RAMESH A, GOH G, AGARWAL S, et al. Learning transferable visual models from natural language supervision. In: Proceedings of International Conference on Machine Learning. Virtual Event: PMLR, 2021. 8787–8801.
    [18] ZENG A, FLORENCE P, THOMPSON J, WELKER S, CHIEN J, ATTARIAN M, et al. Transporter networks: rearranging the visual world for robotic manipulation. In: Proceedings of Conference on Robot Learning. Virtual Event: PMLR, 2020. 726–747.
    [19] PEREZ E, STRUB F, DE VRIES H, DUMOULIN V, COURVILLE A C. FiLM: visual reasoning with a general conditioning layer. In: Proceedings of Conference on Artificial Intelligence. Louisiana, USA: AAAI Press, 2018. 3942–3951.
    [20] JANG E, IRPAN A, KHANSARI M, KAPPLER D, EBERT F, LYNCH C, et al. BC-Z: zero-shot task generalization with robotic imitation learning. In: Proceedings of Conference on Robot Learning. Virtual Event: PMLR, 2021. 991–1002.
    [21] BROHAN A, BROWN N, CARBAJAL J, CHEBOTAR Y, DABIS J, FINN C, et al. RT-1: robotics transformer for real-world control at scale. In: Proceedings of Robotics: Science and Systems. Daegu, Korea, 2023.
    [22] TAN M X, LE Q V. EfficientNet: rethinking model scaling for convolutional neural networks. In: Proceedings of International Conference on Machine Learning. California, USA: PMLR, 2019. 6105–6114.
    [23] CER D, YANG Y F, KONG S Y, HUA N, LIMTIACO N, ST. JOHN R, et al. Universal sentence encoder for English. In: Proceedings of Empirical Methods in Natural Language Processing. Brussels, Belgium: ACL, 2018. 169–174.
    [24] JIANG Y F, GUPTA A, ZHANG Z C, WANG G Z, DOU Y Q, CHEN Y J, et al. VIMA: robot manipulation with multimodal prompts. In: Proceedings of International Conference on Machine Learning. Hawaii, USA: PMLR, 2023.
    [25] REED S, ZOLNA K, PARISOTTO E, GÓMEZ COLMENAREJO S, NOVIKOV A, BARTH-MARON G, et al. A generalist agent. Transactions on Machine Learning Research, 2022, 1: 1−42
    [26] ZHAO T Z, KUMAR V, LEVINE S, FINN C. Learning fine-grained bimanual manipulation with low-cost hardware. In: Proceedings of Robotics: Science and Systems. Daegu, Korea, 2023.
    [27] CHI C, FENG S Y, DU Y L, XU Z J, COUSINEAU E, BURCHFIEL B, et al. Diffusion policy: visuomotor policy learning via action diffusion. In: Proceedings of Robotics: Science and Systems. Daegu, Korea, 2023.
    [28] CHEN Y H, LI H R, ZHAO D B. Boosting continuous control with consistency policy. In: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems. Auckland, New Zealand: IFAAMAS, 2024. 335–344.
    [29] LI H R, JIANG Z N, CHEN Y H, ZHAO D B. Generalizing consistency policy to visual RL with prioritized proximal experience regularization. In: Proceedings of Neural Information Processing Systems. Vancouver, Canada: Curran Associates, 2024.
    [30] GHOSH D, WALKE H R, PERTSCH K, BLACK K, MEES O, DASARI S, et al. Octo: an open-source generalist robot policy. In: Proceedings of Robotics: Science and Systems. Delft, Netherlands, 2024.
    [31] LIU S M, WU L X, LI B G, TAN H K, CHEN H Y, WANG Z Y, et al. RDT-1B: a diffusion foundation model for bimanual manipulation. In: Proceedings of International Conference on Learning Representations. Singapore, 2025.
    [32] O'NEILL A, REHMAN A, MADDUKURI A, GUPTA A, PADALKAR A, LEE A, et al. Open X-embodiment: robotic learning datasets and RT-X models. In: Proceedings of International Conference on Robotics and Automation. Yokohama, Japan: IEEE, 2024. 6892–6903.
    [33] WANG L R, CHEN X L, ZHAO J L, HE K M. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. In: Proceedings of Neural Information Processing Systems. Vancouver, Canada: Curran Associates, 2024.
    [34] MA N Y, GOLDSTEIN M, ALBERGO M S, BOFFI N M, VANDEN-EIJNDEN E, XIE S. SiT: exploring flow and diffusion-based generative models with scalable interpolant transformers. In: Proceedings of European Conference on Computer Vision. Milan, Italy: Springer, 2023. 23–40.
    [35] ZHAI X H, MUSTAFA B, KOLESNIKOV A, BEYER L. Sigmoid loss for language image pre-training. In: Proceedings of International Conference on Computer Vision. Paris, France: IEEE, 2023. 11941–11952.
    [36] DRIESS D, XIA F, SAJJADI M S M, LYNCH C, CHOWDHERY A, ICHTER B, et al. PaLM-E: an embodied multimodal language model. In: Proceedings of International Conference on Machine Learning. Hawaii, USA: PMLR, 2023.
    [37] LI X H, LIU M H, ZHANG H B, YU C J, XU J, WU H T, et al. Vision-language foundation models as effective robot imitators. In: Proceedings of International Conference on Learning Representations. Vienna, Austria, 2024.
    [38] ALAYRAC J B, DONAHUE J, LUC P, MIECH A, BARR I, HASSON Y, et al. Flamingo: a visual language model for few-shot learning. In: Proceedings of Neural Information Processing Systems. Louisiana, USA: Curran Associates, 2022.
    [39] KIM M J, PERTSCH K, KARAMCHETI S, XIAO T, BALAKRISHNA A, NAIR S, et al. OpenVLA: an open-source vision-language-action model. In: Proceedings of Conference on Robot Learning. Munich, Germany: PMLR, 2024. 2679–2713.
    [40] LI Q X, LIANG Y B, WANG Z Y, LUO L, CHEN X, LIAO M Z, et al. CogACT: a foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv: 2411.19650, 2024.
    [41] TOUVRON H, LAVRIL T, IZACARD G, MARTINET X, LACHAUX M A, LACROIX T, et al. LLaMA: open and efficient foundation language models. arXiv preprint arXiv: 2302.13971, 2023.
    [42] BLACK K, BROWN N, DRIESS D, ESMAIL A, EQUI M, FINN C, et al. π0: a vision-language-action flow model for general robot control. In: Proceedings of Robotics: Science and Systems. Los Angeles, USA, 2025.
    [43] BEYER L, STEINER A, PINTO A S, KOLESNIKOV A, WANG X, SALZ D, et al. PaliGemma: a versatile 3B VLM for transfer. arXiv preprint arXiv: 2407.07726, 2024.
    [44] LIU H P, LI X H, LI P Y, LIU M H, WANG D, LIU J R, et al. Towards generalist robot policies: what matters in building vision-language-action models. arXiv preprint arXiv: 2412.14058, 2024.
    [45] DU Y L, YANG S, DAI B, DAI H J, NACHUM O, TENENBAUM J, et al. Learning universal policies via text-guided video generation. In: Proceedings of Neural Information Processing Systems. New Orleans, USA: Curran Associates, 2023.
    [46] HU Y C, GUO Y J, WANG P C, CHEN X Y, WANG Y J, ZHANG J K, et al. Video prediction policy: a generalist robot policy with predictive visual representations. In: Proceedings of International Conference on Machine Learning. Vancouver, Canada: PMLR, 2025.
    [47] WU H T, JING Y, CHEANG C, CHEN G Z, XU J F, LI X H, et al. Unleashing large-scale video generative pre-training for visual robot manipulation. In: Proceedings of International Conference on Learning Representations. Vienna, Austria, 2024.
    [48] CHEANG C L, CHEN G Z, JING Y, KONG T, LI H, LI Y F, et al. GR-2: a generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv: 2410.06158, 2024.
    [49] YE S H, JANG J, JEON B G, JOO S J, YANG J W, PENG B L, et al. Latent action pretraining from videos. In: Proceedings of International Conference on Learning Representations. Singapore, 2025.
    [50] ZHOU Z Y, ZHU Y C, WEN J J, SHEN C M, XU Y. Vision-language-action model with open-world embodied reasoning from pretrained knowledge. arXiv preprint arXiv: 2505.21906, 2025.
    [51] BU Q W, LI H Y, CHEN L, CAI J S, ZENG J, CUI H M, et al. Towards synergistic, generalized, and efficient dual-system for robotic manipulation. arXiv preprint arXiv: 2410.08001, 2024.
    [52] SHI L X, ICHTER B, EQUI M, KE L Y, PERTSCH K, VUONG Q, et al. Hi Robot: open-ended instruction following with hierarchical vision-language-action models. arXiv preprint arXiv: 2502.19417, 2025.
    [53] ZHANG J K, GUO Y J, CHEN X Y, WANG Y J, HU Y C, SHI C M, et al. HiRT: enhancing robotic control with hierarchical robot transformers. In: Proceedings of Conference on Robot Learning. Munich, Germany: PMLR, 2024. 933–946.
    [54] LIU Z Y, GU Y C, ZHENG S X, XUE X Y, FU Y W. TriVLA: a unified triple-system-based unified vision-language-action model for general robot control. arXiv preprint arXiv: 2507.01424, 2025.
    [55] ZHONG Y F, HUANG X C, LI R C, ZHANG C Y, LIANG Y T, YANG Y D, et al. DexGraspVLA: a vision-language-action framework towards general dexterous grasping. arXiv preprint arXiv: 2502.20900, 2025.
    [56] CUI C, DING P X, SONG W X, BAI S H, TONG X Y, GE Z R, et al. OpenHelix: a short survey, empirical analysis, and open-source dual-system VLA model for robotic manipulation. arXiv preprint arXiv: 2505.03912, 2025.
    [57] BJORCK J, CASTAÑEDA F, CHERNIADEV N, DA X Y, DING R Y, FAN L X, et al. GR00T N1: an open foundation model for generalist humanoid robots. arXiv preprint arXiv: 2503.14734, 2025.
    [58] CHEN H, LIU J M, GU C Y, LIU Z Y, ZHANG R R, LI X Q, et al. Fast-in-Slow: a dual-system foundation model unifying fast manipulation within slow reasoning. arXiv preprint arXiv: 2506.01953, 2025.
    [59] TEAM G R, ABEYRUWAN S, AINSLIE J, ALAYRAC J B, ARENAS M G, ARMSTRONG T, et al. Gemini Robotics: bringing AI into the physical world. arXiv preprint arXiv: 2503.20020, 2025.
    [60] SONG H M, QU D L, YAO Y Q, CHEN Q Z, LV Q, TANG Y W, et al. Hume: introducing system-2 thinking in visual-language-action model. arXiv preprint arXiv: 2505.21432, 2025.
    [61] ZAWALSKI M, CHEN W, PERTSCH K, MEES O, FINN C, LEVINE S. Robotic control via embodied chain-of-thought reasoning. In: Proceedings of Conference on Robot Learning. Munich, Germany: PMLR, 2024. 3157–3181.
    [62] SUN Q, HONG P F, DEEP P T, TOH V, TAN U X, GHOSAL D, et al. Emma-X: an embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning. In: Proceedings of Association for Computational Linguistics. Vienna, Austria: ACL, 2025. 14199–14214.
    [63] BU Q W, CAI J S, CHEN L, CUI X Q, DING Y, FENG S Y, et al. Agibot World Colosseo: a large-scale manipulation platform for scalable and intelligent embodied systems. In: Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems. Hangzhou, China: IEEE, 2025.
    [64] BU Q W, YANG Y T, CAI J S, GAO S Y, REN G H, YAO M Q, et al. UniVLA: learning to act anywhere with task-centric latent actions. arXiv preprint arXiv: 2505.06111, 2025.
    [65] ZHAO Q Q, LU Y, KIM M J, FU Z P, ZHANG Z Y, WU Y C, et al. CoT-VLA: visual chain-of-thought reasoning for vision-language-action models. In: Proceedings of Computer Vision and Pattern Recognition. Nashville, USA: IEEE, 2025. 1702–1713
    [66] CEN J, YU C H, YUAN H J, JIANG Y M, HUANG S T, GUO J Y, et al. WorldVLA: towards autoregressive action world model. arXiv preprint arXiv: 2506.21539, 2025.
    [67] ZHANG W Y, LIU H S, QI Z K, WANG Y N, YU X Q, ZHANG J Z, et al. DreamVLA: a vision-language-action model dreamed with comprehensive world knowledge. arXiv preprint arXiv: 2507.04447, 2025.
    [68] QU D L, SONG H M, CHEN Q Z, YAO Y Q, YE X Y, DING Y, et al. SpatialVLA: exploring spatial representations for visual-language-action model. In: Proceedings of Robotics: Science and Systems. California, USA, 2025.
    [69] NIU D T, SHARMA Y, XUE H R, BIAMBY G, ZHANG J Y, JI Z T, et al. Pre-training auto-regressive robotic models with 4D representations. In: Proceedings of Conference on Machine Learning. Vancouver, Canada, 2025.
    [70] SHUKOR M, AUBAKIROVA D, CAPUANO F, KOOIJMANS P, PALMA S, ZOUITINE A, et al. SmolVLA: a vision-language-action model for affordable and efficient robotics. arXiv preprint arXiv: 2506.01844, 2025.
    [71] WANG H Y, XIONG C Y, WANG R P, CHEN X L. BitVLA: 1-bit vision-language-action models for robotics manipulation. arXiv preprint arXiv: 2506.07530, 2025.
    [72] YANG Y T, WANG Y H, WEN Z C, LUO Z W, ZOU C, ZHANG Z P, et al. EfficientVLA: training-free acceleration and compression for vision-language-action models. arXiv preprint arXiv: 2506.10100, 2025.
    [73] BLACK K, GALLIKER M Y, LEVINE S. Real-time execution of action chunking flow policies. arXiv preprint arXiv: 2506.07339, 2025.
    [74] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, WEISSENBERN D, ZHAI X H, UNTERTHINER T, et al. An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of International Conference on Learning Representations. Vienna, Austria, 2021.
    [75] OQUAB M, DAR CET T, MOUTAKANNI T, VO H V, SZAFRANIEC M, KHALIDOV V, et al. DINOv2: learning robust visual features without supervision. Transactions on Machine Learning Research, 2024.
    [76] KIM M J, FINN C, LIANG P. Fine-tuning vision-language-action models: optimizing speed and success. arXiv preprint arXiv: 2502.19645, 2025.
    [77] INTELLIGENCE P, BLACK K, BROWN N, DHARPANIAN J, DHABALIA K, DRIESS D, et al. π0.5: a vision-language-action model with open-world generalization. In: Proceedings of Conference on Robot Learning. Seoul, Korea: PMLR, 2025. 17–40
    [78] PERTSCH K, STACHOWICZ K, ICHTER B, DRIESS D, NAIR S, VUONG Q, et al. FAST: efficient action tokenization for vision-language-action models. arXiv preprint arXiv: 2501.09747, 2025.
    [79] LI C M, WEN J J, PENG Y, PENG Y X, FENG F F, ZHU Y C. PointVLA: injecting the 3D world into vision-language-action models. arXiv preprint arXiv: 2503.07511, 2025.
    [80] CUI W B, ZHAO C Y, CHEN Y H, LI H R, ZHANG Z Z, ZHAO D B, et al. CL3R: 3D reconstruction and contrastive learning for enhanced robotic manipulation representations. arXiv preprint arXiv: 2507.08262, 2025.
    [81] LI Y X, CHEN Y H, ZHOU M C, LI H R. QDepth-VLA: quantized depth prediction as auxiliary supervision for vision-language-action models. arXiv preprint arXiv: 2510.14836, 2025.
    [82] LYU J R, LI Z M, SHI X S, XU C Y, WANG Y Z, WANG H. Dywa: dynamics-adaptive world action model for generalizable non-prehensile manipulation. In: Proceedings of International Conference on Computer Vision. Hawaii, USA: IEEE, 2025.
    [83] YANG R J, CHEN G, WEN C, GAO Y. FP3: a 3D foundation policy for robotic manipulation. arXiv preprint arXiv: 2503.08950, 2025.
    [84] SINGH I, GOYAL A, BIRCHFIELD S, FOX D, GARG A, BLUKIS V. OG-VLA: 3D-aware vision language action model via orthographic image generation. arXiv preprint arXiv: 2506.01196, 2025.
    [85] LI P Y, CHEN Y X, WU H T, MA X, WU X N, HUANG Y, et al. BridgeVLA: input-output alignment for efficient 3D manipulation learning with vision-language models. arXiv preprint arXiv: 2506.07961, 2025.
    [86] JIA Y R, LIU J M, CHEN S X, GU C Y, WANG Z L, LUO L Z, et al. Lift3D Policy: lifting 2D foundation models for robust 3D robotic manipulation. In: Proceedings of Computer Vision and Pattern Recognition. Nashville, USA: IEEE, 2025. 17347–17358
    [87] GOYAL A, XU J, GUO Y J, BLUKIS V, CHAO Y W, FOX D. RVT: robotic view transformer for 3D object manipulation. In: Proceedings of Conference on Robot Learning. Atlanta, USA: PMLR, 2023. 694–710
    [88] GOYAL A, BLUKIS V, XU J, GUO Y J, CHAO Y W, FOX D. RVT-2: learning precise manipulation from few demonstrations. In: Proceedings of Robotics: Science and Systems. Delft, Netherlands, 2024.
    [89] HAO P, ZHANG C F, LI D Z, CAO X G, HAO X S, CUI S W, et al. TLA: tactile-language-action model for contact-rich manipulation. arXiv preprint arXiv: 2503.08548, 2025.
    [90] HUANG J L, WANG S, LIN F Q, HU Y H, WEN C, GAO Y. Tactile-VLA: unlocking vision-language-action model's physical knowledge for tactile generalization. arXiv preprint arXiv: 2507.09160, 2025.
    [91] SUN Y H, CHENG N, ZHANG S X, LI W Z, YANG L Y, CUI S W, et al. Tactile data generation and applications based on visuo-tactile sensors: a review. Information Fusion, 2025, 121(1): 103162
    [92] FENG R X, HU J Y, XIA W K, GAO T C, SHEN A, SUN Y H, et al. AnyTouch: learning unified static-dynamic representation across multiple visuo-tactile sensors. In: Proceedings of International Conference on Learning Representations. Singapore, 2025.
    [93] ZHOU Z Y, ZHU Y C, ZHU M J, WEN J J, LIU N, XU Z Y, et al. ChatVLA: unified multimodal understanding and robot control with vision-language-action model. CoRR, arXiv: 2502.14420, 2025.
    [94] ZHOU Z Y, ZHU Y C, WEN J J, SHEN C M, XU Y. Vision-language-action model with open-world embodied reasoning from pretrained knowledge. arXiv preprint arXiv: 2505.21906, 2025.
    [95] GU A, DAO T. Mamba: linear-time sequence modeling with selective state spaces. In: Proceedings of Conference on Language Modeling. Philadelphia, USA, 2024.
    [96] LIU J M, LIU M Z, WANG Z Y, AN P J, LI X Q, ZHOU K C, et al. RoboMamba: efficient vision-language-action model for robotic reasoning and manipulation. In: Proceedings of Neural Information Processing Systems. Vancouver, Canada: Curran Associates, 2024.
    [97] SHAFIULLAH N M, CUI Z J, ALTANZAYA A, PINTO L. Behavior transformers: cloning k modes with one stone. In: Proceedings of Neural Information Processing Systems. Louisiana, USA: Curran Associates, 2022.
    [98] LEE S J, WANG Y B, ETUKURU H, KIM H J, SHAFIULLAH N M, PINTO L. Behavior generation with latent actions. In: Proceedings of International Conference on Machine Learning. Vienna, Austria: PMLR, 2024.
    [99] LIU J M, CHEN H, AN P J, LIU Z Y, ZHANG R R, GU C Y, et al. HybridVLA: collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv: 2503.10631, 2025.
    [100] DRIESS D, SPRINGENBERG J T, ICHTER B, YU L L, LI-BELL A, PERTSCH K, et al. Knowledge insulating vision-language-action models: train fast, run fast, generalize better. arXiv preprint arXiv: 2505.23705, 2025.
    [101] LIN T Y, MAIRE M, BELONGIE S J, HAYS J, PERONA P, RAMANAN D, et al. Microsoft COCO: common objects in context. In: Proceedings of European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014. 740–755
    [102] YU Q Y, SUN Q, ZHANG X S, CUI Y F, ZHANG F, CAO Y, et al. CapsFusion: rethinking image-text data at scale. In: Proceedings of Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2024. 14022–14032
    [103] JIA Z H, ZHANG Z C, QIAN J Y, WU H N, SUN W, LI C Y, et al. VQA2: visual question answering for video quality assessment. arXiv preprint arXiv: 2411.03795, 2024.
    [104] SINGH A, NATARAJAN V, SHAH M, JIANG Y, CHEN X L, BATRA D, et al. Towards VQA models that can read. In: Proceedings of Computer Vision and Pattern Recognition. Long Beach, CA: IEEE, 2019. 8317–8326
    [105] HUDSON D A, MANNING C D. GQA: a new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of Computer Vision and Pattern Recognition. Long Beach, CA: IEEE, 2019. 6700–6709
    [106] GOYAL R, KAHOU S E, MICHALSKI V, MATERZYNSKA J, WESTPHAL S, KIM H, et al. The "something something" video database for learning and evaluating visual common sense. In: Proceedings of International Conference on Computer Vision. Venice, Italy: IEEE, 2017. 5843–5851
    [107] DAMEN D, DOUGHTY H, FARINELLA G M, FIDLER S, FURNARI A, KAZAKOS E, et al. Scaling egocentric vision: the epic-kitchens dataset. In: Proceedings of European Conference on Computer Vision. Munich, Germany: Springer, 2018. 720–736
    [108] GRAUMAN K, WESTBURY A, BYRNE E, CHAVIS Z, FURNARI A, GIRDHAR R, et al. Ego4D: around the world in 3, 000 hours of egocentric video. In: Proceedings of Computer Vision and Pattern Recognition. New Orleans, USA: IEEE, 2022. 18973–18990
    [109] GRAUMAN K, WESTBURY A, TORRESANI L, KITANI K, MALIK J, AFOURAS T, et al. Ego-Exo4D: understanding skilled human activity from first- and third-person perspectives. In: Proceedings of Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2024. 19383–19400
    [110] YANG R H, YU Q X, WU Y C, YAN R, LI B R, CHENG A C, et al. EgoVLA: learning vision-language-action models from egocentric human videos. arXiv preprint arXiv: 2507.12440, 2025.
    [111] LUO H, FENG Y C, ZHANG W P, ZHENG S P, WANG Y, YUAN H Q, et al. Being-H0: vision-language-action pretraining from large-scale human videos. arXiv preprint arXiv: 2507.15597, 2025.
    [112] LIU X, CHEN Y R, LI H R. Sample-efficient unsupervised policy cloning from ensemble self-supervised labeled videos. In: Proceedings of International Conference on Robotics and Automation. Atlanta, USA: IEEE, 2025. 3632–3639
    [113] NASIRIANY S, MADDUKURI A, ZHANG L, PARIKH A, LO A, JOSHI A, et al. RoboCasa: large-scale simulation of everyday tasks for generalist robots. In: Proceedings of Robotics: Science and Systems. Delft, Netherlands, 2024.
    [114] DENG S L, YAN M, WEI S L, MA H X, YANG Y X, CHEN J Y, et al. GraspVLA: a grasping foundation model pre-trained on billion-scale synthetic action data. In: Proceedings of Conference on Robot Learning. Seoul, Korea: PMLR, 2025.
    [115] CHEN T X, CHEN Z X, CHEN B J, CAI Z J, LIU Y B, LIANG Q W, et al. RoboTwin 2.0: a scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv: 2506.18088, 2025.
    [116] JIANG Z Y, XIE Y Q, LIN K, XU Z J, WAN W K, MANDLEKAR A, et al. DexMimicGen: automated data generation for bimanual dexterous manipulation via imitation learning. In: Proceedings of International Conference on Robotics and Automation. Atlanta, USA: IEEE, 2025. 16923–16930
    [117] LIU W H, WAN Y X, WANG J L, KUANG Y X, SHI X S, LI H R, et al. FetchBot: object fetching in cluttered shelves via zero-shot sim2real. In: Proceedings of Conference on Robot Learning. Seoul, Korea: PMLR, 2025. 2165–2183
    [118] EBERT F, YANG Y L, SCHMECKPEPER K, BUCHER B, GEORGAKIS G, DANIILIDIS K, et al. Bridge Data: boosting generalization of robotic skills with cross-domain datasets. In: Proceedings of Robotics: Science and Systems. New York, USA, 2022.
    [119] WALKE H R, BLACK K, ZHAO T Z, VUONG Q, ZHENG C Y, HANSEN-ESTRUCH P, et al. Bridge Data V2: a dataset for robot learning at scale. In: Proceedings of Conference on Robot Learning. Atlanta, USA: PMLR, 2023. 1723–1736
    [120] KHAZATSKY A, PERTSCH K, NAIR S, BALAKRISHNA A, DASARI S, KARAMCHETI S, et al. DROID: a large-scale in-the-wild robot manipulation dataset. In: Proceedings of Robotics: Science and Systems. Delft, Netherlands, 2024.
    [121] SCHUHMANN C, KACZMARCZYK R, KOMATSUZAKI A, KATTA A, VENCU R, BEAUMONT R, et al. LAION-400M: open dataset of CLIP-filtered 400 million image-text pairs. In: Proceedings of NeurIPS Workshop on Datacentric AI. 2021.
    [122] DEITKE M, CLARK C, LEE S, TRIPATHI R, YANG Y, PARK J S, et al. Molmo and PixMo: open weights and open data for state-of-the-art vision-language models. In: Proceedings of Computer Vision and Pattern Recognition. Nashville, USA: IEEE, 2025. 91–104
    [123] WANG X, ALABDULMOHSIN I, SALZ D, LI Z, RONG K, ZHAI X H. Scaling pre-training to one hundred billion data for vision language models. arXiv preprint arXiv: 2502.07617, 2025.
    [124] YANG J W, TAN R, WU Q H, ZHENG R J, PENG B L, LIANG Y Y, et al. Magma: a foundation model for multimodal AI agents. In: Proceedings of Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2024. 14203–14214
    [125] CARREIRA J, NOLAND E, HILLIER C, ZISSERMAN A. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv: 1907.06987, 2019.
    [126] ZHENG R J, LIANG Y Y, HUANG S Y, GAO J F, DAUMÉ H III, KOLOBOV A, et al. TraceVLA: visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. In: Proceedings of International Conference on Learning Representations. Singapore, 2025.
    [127] DUAN Z K, ZHANG Y, GENG S K, LIU G W, BOEDECKER J, LU C X. Fast ECoT: efficient embodied chain-of-thought via thoughts reuse. arXiv preprint arXiv: 2506.07639, 2025.
    [128] HUNG C Y, SUN Q, HONG P F, ZADEH A, LI C, TAN U, et al. NORA: a small open-sourced generalist vision language action model for embodied tasks. arXiv preprint arXiv: 2504.19854, 2025.
    [129] WEN J J, ZHU Y C, ZHU M J, TANG Z B, LI J M, ZHOU Z Y, et al. DiffusionVLA: a scaling robot foundation models via unified diffusion and autoregression. In: Proceedings of International Conference on Machine Learning. Vancouver, Canada: PMLR, 2025.
    [130] ZHENG J L, LI J X, LIU D X, ZHENG Y N, WANG Z H, OU Z H, et al. Universal actions for enhanced embodied foundation models. In: Proceedings of Computer Vision and Pattern Recognition. Nashville, USA: IEEE, 2025. 22508–22519
    [131] FU Z P, ZHAO T Z, FINN C. Mobile ALOHA: learning bimanual mobile manipulation using low-cost whole-body teleoperation. In: Proceedings of Conference on Robot Learning. Seoul, Korea: PMLR, 2024. 4066–4083
    [132] FANG H S, FANG H J, TANG Z Y, LIU J R, WANG C X, WANG J B, et al. RH20T: a comprehensive robotic dataset for learning diverse skills in one-shot. In: Proceedings of International Conference on Robotics and Automation. Yokohama, Japan: IEEE, 2024. 653–660
    [133] KUMAR V, SHAH R M, ZHOU G Y, MOENS V, CAGGIANO V, GUPTA A, et al. RoboHive: a unified framework for robot learning. In: Proceedings of Neural Information Processing Systems. New Orleans, USA: Curran Associates, 2023.
    [134] WU K, HOU C K, LIU J M, CHE Z P, JU X Z, YANG Z Q, et al. RoboMIND: benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv: 2412.13877, 2024.
    [135] BHARADHWAJ H, DWIBEDI D, GUPTA A, TULSIANI S, DOERSCH C, XIAO T, et al. Gen2Act: human video generation in novel scenarios enables generalizable robot manipulation. In: Proceedings of Conference on Robot Learning. Munich, Germany: PMLR, 2024.
    [136] WANG Z Q, ZHENG H, NIE Y S, XU W J, WANG Q W, YE H, et al. All robots in one: a new standard and unified dataset for versatile, general-purpose embodied agents. arXiv preprint arXiv: 2408.10899, 2024.
    [137] ESSER P, ROMBACH R, OMMER B. Taming transformers for high-resolution image synthesis. In: Proceedings of Computer Vision and Pattern Recognition. 2021. 12873–12883
    [138] CHEANG C, CHEN S J, CUI Z R, HU Y D, HUANG L Q, KONG T, et al. GR-3 technical report. arXiv preprint arXiv: 2507.15493, 2025.
    [139] CLARK J, MIRCHANDANI S, SADIGH D, BELKHALE S. Action-free reasoning for policy generalization. In: Proceedings of ICRA 2025 Workshop on Foundation Models and Neuro-Symbolic AI for Robotics. Atlanta, USA, 2025.
    [140] TANG W L, JING D, PAN J H, LU Z W, LIU Y H, LI L E, et al. Incentivizing multimodal reasoning in large models for direct robot manipulation. arXiv preprint arXiv: 2505.12744, 2025.
    [141] ZHANG J, WU S H, LUO X, WU H, GAO L L, SHEN H T, et al. InSpire: vision-language-action models with intrinsic spatial reasoning. arXiv preprint arXiv: 2505.13888, 2025.
    [142] LIN F Q, NAI R Q, HU Y D, YOU J C, ZHAO J M, GAO Y. OneTwoVLA: a unified vision-language-action model with adaptive reasoning. arXiv preprint arXiv: 2505.11917, 2025.
    [143] CHEN W, BELKHALE S, MIRCHANDANI S, MEES O, DRIESS D, PERTSCH K, et al. Training strategies for efficient embodied reasoning. arXiv preprint arXiv: 2505.08243, 2025.
    [144] KUMAR K, ASHRAF T, THAWAKAR O, ANWER R M, CHOLAKKAL H, SHAH M, et al. LLM post-training: a deep dive into reasoning large language models. arXiv preprint arXiv: 2502.21321, 2025.
    [145] CHEN Y H, TIAN S, LIU S G, ZHOU Y T, LI H R, ZHAO D B. ConRFT: a reinforced fine-tuning method for VLA models via consistency policy. In: Proceedings of Robotics: Science and Systems. Los Angeles, USA, 2025.
    [146] ZHANG Z J, ZHENG K Y, CHEN Z R, JANG J, LI Y, HAN S W, et al. GRAPE: generalizing robot policy via preference alignment. In: Proceedings of ICRA 2025 Workshop on Foundation Models and Neuro-Symbolic AI for Robotics. Atlanta, USA, 2025.
    [147] GUO Y J, ZHANG J K, CHEN X Y, JI X, WANG Y J, HU Y C, et al. Improving vision-language-action model with online reinforcement learning. In: Proceedings of International Conference on Robotics and Automation. Atlanta, USA: IEEE, 2025.
    [148] MARK M S, GAO T, SAMPAIO G G, SRIRAMA M K, SHARMA A, FINN C, et al. Policy-agnostic RL: offline RL and online RL fine-tuning of any class and backbone. In: Proceedings of The 7th Robot Learning Workshop at ICLR 2025. Singapore, 2025.
    [149] YUAN X, MU T Z, TAO S, FANG Y H, ZHANG M K, SU H. Policy decorator: model-agnostic online refinement for large policy model. In: Proceedings of International Conference on Learning Representations. Singapore, 2025.
    [150] ZHANG H Y, ZHUANG Z F, ZHAO H, DING P X, LU H C, WANG D L. ReinboT: amplifying robot visual-language manipulation with reinforcement learning. In: Proceedings of International Conference on Machine Learning. Vancouver, Canada: PMLR, 2025.
    [151] TAN S H, DOU K R, ZHAO Y, KRAEHENBUEHL P. Interactive post-training for vision-language-action models. In: Proceedings of Workshop on Foundation Models Meet Embodied Agents at CVPR 2025. Denver, USA, 2025.
    [152] XU C, LI Q Y, LUO J L, LEVINE S. RLDG: robotic generalist policy distillation via reinforcement learning. In: Proceedings of Robotics: Science and Systems. Los Angeles, USA, 2025.
    [153] CHEN Z J, NIU R L, KONG H, WANG Q. TGRPO: fine-tuning vision-language-action model via trajectory-wise group relative policy optimization. arXiv preprint arXiv: 2506.08440, 2025.
    [154] LU G X, GUO W K, ZHANG C B, ZHOU Y H, JIANG H N, GAO Z F, et al. VLA-RL: towards masterful and general robotic manipulation with scalable reinforcement learning. arXiv preprint arXiv: 2505.18719, 2025.
    [155] WU Y L, TIAN R, SWAMY G, BAJCSY A. From foresight to forethought: VLM-in-the-loop policy steering via latent alignment. In: Proceedings of Robotics: Science and Systems. Los Angeles, USA, 2025.
    [156] WANG Y W, WANG L R, DU Y L, SUNDARALINGAM B, YANG X N, CHAO Y W, et al. Inference-time policy steering through human interactions. In: Proceedings of International Conference on Robotics and Automation. Atlanta, USA: IEEE, 2025. 15626–15633
    [157] KWOK J, AGIA C, SINHA R, FOUTTER M, LI S L, STOICA I, et al. RoboMonkey: scaling test-time sampling and verification for vision-language-action models. In: Proceedings of Second Workshop on Out-of-Distribution Generalization in Robotics at RSS 2025. Los Angeles, USA, 2025.
    [158] NAKAMOTO M, MEES O, KUMAR A, LEVINE S. Steering your generalists: improving robotic foundation models via value guidance. In: Proceedings of Conference on Robot Learning. Munich, Germany: PMLR, 2024. 4996–5013
    [159] KHAZATSKY A, PERTSCH K, NAIR S, BALAKRISHNA A, DASARI S, KARAMCHETI S, et al. DROID: a large-scale in-the-wild robot manipulation dataset. In: Proceedings of Robotics: Science and Systems. Delft, Netherlands, 2024.
    [160] GUO D Y, YANG D J, ZHANG H W, SONG J X, WANG P Y, ZHU Q H, et al. DeepSeek-R1: incentivizing reasoning capability in llms via reinforcement learning. Nature, 2025, 645(8081): 633−638 doi: 10.1038/s41586-025-09422-z
    [161] LIU J J, GAO F, WEI B W, CHEN X L, LIAO Q M, WU Y, et al. What can RL bring to VLA generalization? An empirical study. arXiv preprint arXiv: 2505.19789, 2025.
    [162] ZHENG J L, LI J X, WANG Z H, LIU D X, KANG X R, FENG Y C, et al. X-VLA: soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv: 2510.10274, 2025.
    [163] IRVING F, ZHANG J X, TONG S B, FENG C. From intention to execution: probing the generalization boundaries of vision-language-action models. arXiv preprint arXiv: 2506.09930, 2025.
    [164] JANGIR Y, ZHANG Y D, YAMAZAKI K, ZHANG C Y, TU K H, KE T W, et al. RobotArena ∞: unlimited robot benchmarking via real-to-sim translation. arXiv preprint arXiv: 2510.23571, 2025.
    [165] GAO J, BELKHALE S, DASARI S, BALAKRISHNA A, SHAH D, SADIGH D. A taxonomy for evaluating generalist robot policies. arXiv preprint arXiv: 2503.01238, 2025.
    [166] ZHOU J M, YE K, LIU J Y, MA T L, WANG Z F, QIU R H, et al. Exploring the limits of vision-language-action manipulations in cross-task generalization. arXiv preprint arXiv: 2505.15660, 2025.
    [167] WANG Z J, ZHOU Z H, SONG J Y, HUANG Y H, SHU Z, MA L. VLATest: testing and evaluating vision-language-action models for robotic manipulation. Proceedings of the ACM on Software Engineering, 2025, 2(1): 1615−1638
    [168] ATREYA P, PERTSCH K, LEE T, KIM M J, JAIN A, KURAMSHIN A, et al. RoboArena: distributed real-world evaluation of generalist robot policies. In: Proceedings of Conference on Robot Learning. Seoul, Korea: PMLR, 2025.
    [169] WANG Y R, UNG C, TANNERT G, DUAN J F, LI J, LE A, et al. RoboEval: where robotic manipulation meets structured and scalable evaluation. arXiv preprint arXiv: 2507.00435, 2025.
    [170] LUO J L, XU C, LIU F C, TAN L, LIN Z P, WU J, et al. FMB: a functional manipulation benchmark for generalizable robotic learning. International Journal of Robotics Research, 2025, 44(4): 592−606 doi: 10.1177/02783649241276017
    [171] ZHOU Z Y, ATREYA P, TAN Y L, PERTSCH K, LEVINE S. AutoEval: autonomous evaluation of generalist robot manipulation policies in the real world. In: Proceedings of 7th Robot Learning Workshop: Towards Robots with Human-Level Abilities. Singapore, 2025.
    [172] MEES O, HERMANN L, ROSETE-BEAS E, BURGARD W. CALVIN: a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 2022, 7(3): 7327−7334 doi: 10.1109/LRA.2022.3180108
    [173] GUPTA A, KUMAR V, LYNCH C, LEVINE S, HAUSMAN K. Relay policy learning: solving long-horizon tasks via imitation and reinforcement learning. In: Proceedings of Conference on Robot Learning. Seoul, Korea: PMLR, 2025. 1025–1037
    [174] LI X L, HSU K, GU J Y, MEES O, PERTSCH K, WALKE H R, et al. Evaluating real-world robot manipulation policies in simulation. In: Proceedings of Conference on Robot Learning. Seoul, Korea: PMLR, 2025. 3705–3728
    [175] LIU B, ZHU Y F, GAO C K, FENG Y H, LIU Q, ZHU Y K, et al. LIBERO: benchmarking knowledge transfer for lifelong robot learning. In: Proceedings of Neural Information Processing Systems. New Orleans, USA: Curran Associates, 2023.
    [176] GAO C K, ZHANG H Z, XU Z X, CAI Z H, LIN S. FLIP: flow-centric generative planning as general-purpose manipulation world model. In: Proceedings of International Conference on Learning Representations. Singapore, 2025.
    [177] LI Y, MENG Y, SUN Z W, JI K Y, TANG C, FAN J J, et al. SP-VLA: a joint model scheduling and token pruning approach for VLA model acceleration. arXiv preprint arXiv: 2506.12723, 2025.
    [178] YU T H, QUILLEN D, HE Z P, JULIAN R, HAUSMAN K, FINN C, et al. Meta-World: a benchmark and evaluation for multi-task and meta reinforcement learning. In: Proceedings of Conference on Robot Learning. Osaka, Japan: PMLR, 2019. 1094–1100
    [179] WEN J J, ZHU Y C, LI J M, ZHU M J, TANG Z B, WU K, et al. TinyVLA: toward fast, data-efficient vision-language-action models for robotic manipulation. IEEE Robotics and Automation Letters, 2025, 10(4): 3988−3995 doi: 10.1109/LRA.2025.3544909
    [180] JAMES S, MA Z C, ARROJO D R, DAVISON A J. RLBench: the robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 2020, 5(2): 3019−3026 doi: 10.1109/LRA.2020.2974707
    [181] YAN F, LIU F F, ZHENG L M, ZHONG Y F, HUANG Y Y, GUAN Z C, et al. Robomm: all-in-one multimodal large model for robotic manipulation. arXiv preprint arXiv: 2412.07215, 2024.
    [182] MANDLEKAR A, XU D F, WONG J, NASIRIANY S, WANG C, KULKARNI R, et al. What matters in learning from offline human demonstrations for robot manipulation. In: Proceedings of Conference on Robot Learning. London, UK: PMLR, 2021. 1678–1690
    [183] FANG I, ZHANG J X, TONG S B, FENG C. From intention to execution: probing the generalization boundaries of vision-language-action models. arXiv preprint arXiv: 2506.09930, 2025.
    [184] GU J Y, XIANG F B, LI X L, LING Z, LIU X Q, MU T Z, et al. ManiSkill2: a unified benchmark for generalizable manipulation skills. In: Proceedings of International Conference on Learning Representations. Kigali, Rwanda, 2023.
    [185] HO D, MONAS J, REN J T, YU C. 1X World Model: evaluating bits, not atoms.[Online]. Available: https://www.1x.tech/1x-world-model.pdf, Nov. 6, 2025
    [186] HUANG S Q, WU J L, ZHOU Q X, MIAO S C, LONG M S. Vid2World: crafting video diffusion models to interactive world models. arXiv preprint arXiv: 2505.14357, 2025.
    [187] LI Y X, ZHU Y C, WEN J J, SHEN C M, XU Y. WorldEval: world model as real-world robot policies evaluator. arXiv preprint arXiv: 2505.19017, 2025.
    [188] JIANG Y M, HUANG S T, XUE S K, ZHAO Y X, CEN J, LENG S C, et al. RynnVLA-001: using human demonstrations to improve robot manipulation. arXiv preprint arXiv: 2509.15212, 2025.
    [189] WEN J J, ZHU Y C, LI J M, TANG Z B, SHEN C M, FENG F F. DexVLA: vision-language model with plug-in diffusion expert for general robot control. arXiv preprint arXiv: 2502.05855, 2025.
    [190] JIANG T, YUAN T Y, LIU Y C, LU C H, CUI J N, LIU X, et al. Galaxea open-world dataset and G0 dual-system VLA model. arXiv preprint arXiv: 2509.00576, 2025.
    [191] GAO G, WANG J N, ZUO J B, JIANG J N, ZHANG J F, ZENG X W, et al. Towards human-level intelligence via human-like whole-body manipulation. arXiv preprint arXiv: 2507.17141, 2025.
    [192] LI P H, WU Y Y, XI Z H, LI W L, HUANG Y Z, ZHANG Z Y, et al. ControlVLA: few-shot object-centric adaptation for pre-trained vision-language-action models. In: Proceedings of Conference on Robot Learning. Seoul, Korea: PMLR, 2025. 2165–2183
    [193] ZHAI A, LIU B, FANG B, CAI C, MA E, YIN E, et al. Igniting VLMs toward the embodied space. arXiv preprint arXiv: 2509.11766, 2025.
  • 加载中
计量
  • 文章访问数:  27
  • HTML全文浏览量:  16
  • 被引次数: 0
出版历程
  • 收稿日期:  2025-08-25
  • 录用日期:  2025-11-06
  • 网络出版日期:  2025-12-15

目录

    /

    返回文章
    返回