2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于梯度损失的离线强化学习算法

陈鹏宇 刘士荣 段帅 端军红 刘扬

陈鹏宇, 刘士荣, 段帅, 端军红, 刘扬. 基于梯度损失的离线强化学习算法. 自动化学报, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240481
引用本文: 陈鹏宇, 刘士荣, 段帅, 端军红, 刘扬. 基于梯度损失的离线强化学习算法. 自动化学报, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240481
Chen Peng-Yu, Liu Shi-Rong, Duan Shuai, Duan Jun-Hong, Liu Yang. Gradient loss for offline reinforcement learning. Acta Automatica Sinica, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240481
Citation: Chen Peng-Yu, Liu Shi-Rong, Duan Shuai, Duan Jun-Hong, Liu Yang. Gradient loss for offline reinforcement learning. Acta Automatica Sinica, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240481

基于梯度损失的离线强化学习算法

doi: 10.16383/j.aas.c240481 cstr: 32138.14.j.aas.c240481
基金项目: 国家自然科学基金(62071154, 62173340)资助
详细信息
    作者简介:

    陈鹏宇:哈尔滨工业大学计算学部博士研究生. 2024年获得哈尔滨工业大学人工智能学士学位. 主要研究方向为强化学习. E-mail: chenpengyu02@foxmail.com

    刘士荣:哈尔滨工业大学计算学部博士研究生. 2023年获得哈尔滨工业大学计算机科学与技术硕士学位. 主要研究方向为离线强化学习. 本文通信作者. E-mail: shirongliu16@gmail.com

    段帅:哈尔滨工业大学计算学部硕士研究生. 2022年获得哈尔滨工业大学计算学部学士学位. 主要研究方向为多智能体与强化学习. E-mail:18845773665@163.com

    端军红:空军工程大学防空反导学院副教授. 长期从事导弹智能博弈制导技术, 火力控制技术, 机器学习及其应用的研究工作. E-mail:duanjunhonggx@163.com

    刘扬:哈尔滨工业大学计算学部教授. 长期从事强化学习与多智能体、生成模型与分子设计的研究工作. E-mail:liuyang@hit.edu.cn

Gradient Loss for Offline Reinforcement Learning

Funds: Supported by National Natural Science Foundation of China (62071154, 62173340)
More Information
    Author Bio:

    CHEN Peng-Yu Ph.D. candidate at the Faculty of Computing, Harbin Institute of Technology. He received his bachelor degree in artificial intelligence from Harbin Institute of Technology in 2024. His main research interest is reinforcement learning

    LIU Shi-Rong Ph.D. candidate at the Faculty of Computing, Harbin Institute of Technology. He received his master degree in computer science and technology from Harbin Institute of Technology in 2023. His main research interest is offline reinforcement learning. Corresponding author of this paper

    DUAN Shuai Master student at the Faculty of Computing, Harbin Institute of Technology. He received his bachelor degree from Harbin Institute of Technology in 2022. His main research interest is multi-agent and reinforcement learning

    DUAN Jun-Hong Associate professor at the Air and Missile Defense College, Air Force Engineering University. His research interest covers intelligent missile guidance, fire control technology, and machine learning and its applications

    LIU Yang Professor at the Faculty of Computing, Harbin Institute of Technology. His research interest covers reinforcement learning and multi-agent, generative model and molecular design

  • 摘要: 离线强化学习领域面临的核心挑战在于如何避免分布偏移并限制值函数的过估计问题. 尽管传统的TD3+BC算法通过引入行为克隆正则项, 有效地约束了习得策略, 使其更接近行为策略, 从而在一定程度上得到了有竞争力的性能, 但其策略稳定性在训练过程中仍有待提高. 尤其在现实世界中, 策略验证可能涉及高昂的成本, 因此提高策略稳定性尤为关键. 本研究受到深度学习中“平坦最小值”概念的启发, 旨在探索目标策略损失函数在动作空间中的平坦区域, 以得到稳定策略. 为此, 提出了一种梯度损失函数(Gradient Loss Function, GLF), 并基于此设计了一种新的离线强化学习算法—梯度损失离线强化学习算法 (Gradient Loss for Offline Reinforcement Learning, GLO). 在D4RL基准数据集上的实验结果表明, GLO算法在性能上超越了当前的主流算法. 此外, 还尝试将本研究的方法扩展到在线强化学习领域, 实验结果证明了该方法在在线强化学习环境下 的普适性和有效性.
  • 图  1  hopper-medium环境下的标准化奖励分布直方图

    Fig.  1  Histogram of normalized reward distribution for hopper-medium environment

    图  2  $\lambda$对算法表现的影响

    Fig.  2  The effect of $\lambda$ on algorithm performance

    A1  halfcheetah-medium

    A11  antmaze-umaze-diverse

    A2  halfcheetah-medium-replay

    A3  halfcheetah-medium-expert

    A4  walker2d-medium

    A5  walker2d-medium-replay

    A6  walker2d-medium-expert

    A7  hopper-medium

    A8  hopper-medium-replay

    A9  hopper-medium-expert

    A10  antmaze-umaze

    B1  halfcheetah-medium

    B9  hopper-medium-expert

    B2  halfcheetah-medium-replay

    B3  halfcheetah-medium-expert

    B4  walker2d-medium

    B5  walker2d-medium-replay

    B6  walker2d-medium-expert

    B7  hopper-medium

    B8  hopper-medium-replay

    表  1  实验中的超参数设置

    Table  1  Hyperparameters in the experiment

    超参数名称 描述
    $ \gamma $ 0.99 折扣因子
    max_timesteps 1e6 训练轮数
    policy_freq 2 每过几轮更新策略
    Q函数网络 FC(256,256,256) 全连接层(Fully connected layers,
    FC), 加入标准化层,
    激活函数为ReLU
    BC网络、残差
    策略网络
    FC(256,256,256) 全连接层, 激活函数为ReLU
    学习率 0.001 策略网络和Q函数网络学习率
    优化器 Adam 策略网络和Q函数网络优化器
    $ \tau _ q$ 表2 Q函数网络更新系数
    $ \tau _ {action}$ 表2 策略网络更新系数
    $ \lambda$ 表2 梯度损失函数中梯度的系数
    下载: 导出CSV

    表  2  实验中更新系数$ \tau $的设置

    Table  2  Update rate $ \tau $ settings in the experiment

    环境 $ \tau _ q$ $ \tau _ {\text{action}}$ $ \lambda$
    halfcheetah-m 5e-3 5e-3 1.5e-1
    halfcheetah-m-r 2e-3 1e-5 5e-2
    halfcheetah-m-e 1e-3 1e-5 5e-3
    walker2d-m 2e-3 1e-6 1e-1
    walker2d-m-r 4e-4 1e-6 3e-2
    walker2d-m-e 2e-3 2e-3 4e-2
    hopper-m 2e-3 5e-7 5e-2
    hopper-m-r 1e-3 1e-3 2e-2
    hopper-m-e 2e-3 1e-5 1e-2
    antmaze-umaze 2e-3 2e-3 1e-3
    antmaze-umaze-diverse 5e-3 5e-3 1e1
    下载: 导出CSV

    表  3  GLO和主流算法得到的标准化奖励

    Table  3  Normalized rewards achieved by GLO and mainstream algorithms

    环境 TD3+BC IQL CQL Diffusion-QL SPOT wPC GLO
    halfcheetah-m 48.3±0.3 48.3±0.2 47.0±0.2 51.1±0.5 58.4±1.0 53.3 57.1±0.4
    halfcheetah-m-r 44.6±0.5 44.5±0.2 45.0±0.3 47.8±0.3 52.2±1.2 48.3 48.0±0.3
    halfcheetah-m-e 90.7±4.3 94.7±0.5 95.6±0.4 96.8±0.3 86.9±4.3 93.7 96.2±0.9
    walker2d-m 83.7±2.1 80.9±3.2 80.7±3.3 87.0±0.9 86.4±2.7 86.0 90.2±1.3
    walker2d-m-r 81.8±5.5 82.2±3.0 73.1±13.2 95.5±1.5 91.6±2.8 89.9 89.6±1.4
    walker2d-m-e 110.1±0.5 111.7±0.9 109.6±0.4 110.1±0.3 112.1±0.5 110.1 111.8±1.2
    hopper-m 59.3±4.2 67.5±3.8 59.1±3.8 90.5±4.6 86.0±8.7 86.5 94.1±1.7
    hopper-m-r 60.9±18.8 97.4±6.4 95.1±5.3 101.3±0.6 100.2±1.9 97.0 99.2±1.6
    hopper-m-e 98.4±9.4 107.4±7.8 99.3±10.9 111.1±1.3 99.3±7.1 95.7 111.7±1.4
    平均 75.3 81.6 78.3 87.9 85.9 84.5 88.7
    antmaze-umaze-v2 70.8±39.2 77.0±5.5 92.8±1.2 93.4±3.4 93.5±2.4 93.8±1.8
    antmaze-umaze-diverse-v2 44.8±11.6 54.3±5.5 37.3±3.7 66.2±8.6 40.7±5.1 51.8±5.3
    平均 55.0 65.7 65.0 79.8 67.1 72.8
    下载: 导出CSV

    表  4  不同算法对应的梯度范数

    Table  4  The gradient norm of different algorithm

    环境 TD3+BC GLO
    halfcheetah-m 2.0e-4 5.0e-5
    halfcheetah-m-r 4.0e-4 1.2e-4
    halfcheetah-m-e 2.2e-4 1.1e-4
    walker2d-m 2.2e-4 8.7e-5
    walker2d-m-r 4.1e-4 7.3e-4
    walker2d-m-e 2.1e-4 3.0e-5
    hopper-m 3.5e-4 5.2e-5
    hopper-m-r 5.7e-4 1.0e-4
    hopper-m-e 3.8e-4 3.0e-5
    antmaze-umaze 1.6e-3 5.4e-4
    antmaze-umaze-diverse 6.2e-4 2.9e-4
    下载: 导出CSV

    表  5  实验中初始状态的随机扰动设置

    Table  5  Random noise settings for the initial state

    环境 数据集随机扰动分布 测试随机扰动分布
    halfcheetah [−0.1, 0.1] $ [-0.20,\;-0.15] \cup [0.15,\;0.20] $
    walker2d [−0.005, 0.005] $ [-0.025,\;-0.01] \cup [0.01,\;0.025] $
    hopper [−0.005, 0.005] $ [-0.025,\;-0.01] \cup [0.01,\;0.025] $
    下载: 导出CSV

    表  6  泛化能力测试时的平均标准化奖励

    Table  6  Average normalized rewards in generalization testing

    环境 TD3+BC GLO
    halfcheetah-m 47.0 56.4
    halfcheetah-m-r 43.7 47.5
    halfcheetah-m-e 89.9 93.8
    walker2d-m 81.7 88.4
    walker2d-m-r 77.9 87.6
    walker2d-m-e 110.0 111.5
    hopper-m 55.2 94.4
    hopper-m-r 53.2 77.7
    hopper-m-e 98.4 105.0
    平均 73.0 84.7
    下载: 导出CSV

    表  7  消融实验实验结果

    Table  7  Ablation experiment results

    环境 TD3+BC 原损失函数(实验1) 无Q norm(实验2) 无反向软更新(实验3) GLO
    halfcheetah-m 48.3±0.3 49.5±0.2 57.5±1.2 58.0±0.3 57.1±0.4
    halfcheetah-m-r 44.6±0.5 45.4±0.2 49.0±0.9 47.9±0.4 48.0±0.3
    halfcheetah-m-e 90.7±4.3 96.3±0.5 87.7±4.5 96.1±0.8 96.2±0.9
    walker2d-m 83.7±2.1 84.6±1.4 89.9±2.3 90.0±1.3 90.2±1.3
    walker2d-m-r 81.8±5.5 66.8±30.7 85.6±2.6 89.6±1.3 89.6±1.4
    walker2d-m-e 110.1±0.5 109.7±0.2 92.2±46.9 111.2±0.7 111.8±1.2
    hopper-m 59.3±4.2 71.0±5.1 77.9±15.3 93.3±2.3 94.1±1.7
    hopper-m-r 60.9±18.8 81.2±10.8 76.1±18.4 99.2±1.6 99.2±1.6
    hopper-m-e 98.4±9.4 110.1±3.1 106.6±8.4 111.8±1.2 111.7±1.4
    平均 75.3 79.4 80.3 88.5 88.7
    下载: 导出CSV

    表  8  不同数据集下拐点的$ \lambda$值

    Table  8  The $\lambda$ corresponding to the inflection point in different datasets

    环境 m m-r m-e
    halfcheetah 0.10 0.05 0.01
    walker2d 0.10 0.03 0.03
    Hopper 0.05 0.02 0.01
    下载: 导出CSV

    表  9  在线强化学习实验中设置的超参数

    Table  9  Hyperparameters in online reinforcement learning experiments

    环境 $ \alpha$
    Ant 100
    Halfcheetah 10
    Hopper 10
    invertedPendulum 1
    reacher 100
    inverteddoublePendulum 0.1
    walker 0.1
    下载: 导出CSV

    表  10  在线强化学习中梯度损失函数的应用实验结果

    Table  10  Experimental results of GLF in online reinforcement learning

    环境 TD3 TD3+正则
    ant 5094.5±1116.1 5340.0±343.8
    halfcheetah 9638.3±998.0 10577.1±660.0
    hopper 2427.3±1180.1 3371.6±142.0
    invertedPendulum 1000 1000
    reacher −3.888±0.446 −3.896±0.433
    inverteddoublePendulum 5765.2±4863.0 7480.4±4166.7
    walker 3141.8±1142.6 4378.4±707.9
    下载: 导出CSV

    11  SPOT中梯度损失函数的应用实验结果

    11  Experimental results of GLF in SPOT

    环境 SPOT SPOT+正则
    halfcheetah-m 58.4±1.0 60.1 ±1.2
    halfcheetah-m-r 52.2±1.2 53.3±0.4
    halfcheetah-m-e 86.9±4.3 92.8±0.6
    walker2d-m 86.4±2.7 88.4±1.1
    walker2d-m-r 91.6±2.8 92.7±2.4
    walker2d-m-e 112.1±0.5 110.6±2.0
    hopper-m 86.0±8.7 96.3±5.3
    hopper-m-r 100.2±1.9 101.4±0.9
    hopper-m-e 99.3±7.1 103.8±2.7
    平均 85.9 88.8
    下载: 导出CSV
  • [1] 刘全, 翟建伟, 章宗长, 钟珊, 周倩, 章鹏, 等. 深度强化学习综述. 计算机学报, 2018, 41(01): 1−27 doi: 10.11897/SP.J.1016.2018.00001

    Liu Quan, Zhai Jian-Wei, Zhang Zong-Chang, Zhong Shan, Zhou Qian, Zhang Peng, et al. A survey on deep reinforcement learning. Chinese Journal of Computers, 2018, 41(01): 1−27 doi: 10.11897/SP.J.1016.2018.00001
    [2] 丁世飞, 杜威, 张健, 郭丽丽, 丁玲. 多智能体深度强化学习研究进展. 计算机学报, 2024, 47(7): 1547−1567

    Ding Shi-Fei, Du Wei, Zhang Jian, Guo Li-Li, Ding Ling. Research progress of multi-agent deep reinforcement learning. Chinese Journal of Computers, 2024, 47(7): 1547−1567
    [3] 吴晓光, 刘绍维, 杨磊, 邓文强, 贾哲恒. 基于深度强化学习的双足机器人斜坡步态控制方法. 自动化学报, 2021, 47(8): 1974−1987

    Wu Xiao-Guang, Liu Shao-Wei, Yang Lei, Deng Wen-Qiang, Jia Zhe-Heng. A gait control method for biped robot on slope based on deep reinforcement learning. Acta Automatica Sinica, 2021, 47(8): 1974−1987
    [4] 张威振, 何真, 汤张帆. 风扰下无人机栖落机动的强化学习控制设计. 上海交通大学学报, 2024, 58(11): 1753−1761

    Zhang Wei-Zhen, He Zhen, Tang Zhang-Fan. Reinforcement learning control design for perching maneuver of unmanned aerial vehicles with wind disturbances. Journal of Shanghai Jiao tong University, 2024, 58(11): 1753−1761
    [5] Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guezet A, et al. Mastering the game of go without human knowledge. Nature, 2017, 550(7676): 354−359 doi: 10.1038/nature24270
    [6] Vinyals O, Babuschkin I, Czarnecki W M, Mathieu M, Dudzik A, Chung J, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 2019, 575(7782): 350−354 doi: 10.1038/s41586-019-1724-z
    [7] 余宏晖, 林声宏, 朱建全, 陈浩悟. 基于深度强化学习的微电网在线优化. 电测与仪表, 2024, 61(4): 9−14

    Yu Hong-Hui, Lin Sheng-Hong, Zhu Jian-Quan, Chen Hao-Wu. On-line optimization of micro-grid based on deep reinforcement learning. Electric Measurement and Instrumentation, 2024, 61(4): 9−14
    [8] 张沛, 陈玉鑫, 王光华, 李晓影. 基于图强化学习的配电网故障恢复决策. 电力系统自动化, 2024, 48(2): 151−158 doi: 10.7500/AEPS20230316001

    Zhang Pei, Chen Yu-Xin, Wang Guang-Hua, Li Xiao-Ying. Fault recovery decision of distribution network based on graph reinforcement learning. Automation of Electric Power Systems, 2024, 48(2): 151−158 doi: 10.7500/AEPS20230316001
    [9] 刘健, 顾扬, 程玉虎, 王雪松. 基于多智能体强化学习的乳腺癌致病基因预测. 自动化学报, 2022, 48(5): 1246−1258

    Liu Jian, Gu Yang, Cheng Yu-Hu, Wang Xue-Song. Prediction of breast cancer pathogenic genes based on multi-agent reinforcement learning. Acta Automatica Sinica, 2022, 48(5): 1246−1258
    [10] Choudhary K, Gupta D, Thomas P S. ICU-Sepsis: a benchmark MDP built from real medical data. arXiv preprint arXiv: 2406.05646, 2024.
    [11] 张兴龙, 陆阳, 李文璋, 徐昕. 基于滚动时域强化学习的智能车辆侧向控制算法. 自动化学报, 2023, 49(12): 2481−2492

    Zhang Xing-Long, Lu Yang, Li Wen-Zhang, Xu Xin. Receding horizon reinforcement learning algorithm for lateral control of intelligent vehicles. Acta Automatica Sinica, 2023, 49(12): 2481−2492
    [12] 何逸煦, 林泓熠, 刘洋, 杨澜, 曲小波. 强化学习在自动驾驶技术中的应用与挑战. 同济大学学报(自然科学版), 2024, 52(4): 520−531 doi: 10.11908/j.issn.0253-374x.23397

    He Yi-Xu, Lin Hong-Yi, Liu Yang, Yang Lan, Qu Xiao-Bo. Application and challenges of reinforcement learning in autonomous driving technology. Journal of Tongji University (Natural Science), 2024, 52(4): 520−531 doi: 10.11908/j.issn.0253-374x.23397
    [13] Vasco M, Seno T, Kawamoto K, Subramanian K, Wurman P R, Stone P. A super-human vision-based reinforcement learning agent for autonomous racing in gran turismo. arXiv preprint arXiv: 2406.12563, 2024.
    [14] 刘扬, 何泽众, 王春宇, 郭茂祖. 基于DDPG算法的末制导律设计研究. 计算机学报, 2021, 44(9): 1854−1865 doi: 10.11897/SP.J.1016.2021.01854

    Liu Yang, He Ze-Zhong, Wang Chun-Yu, Guo Mao-Zu. Terminal guidance law design based on DDPG algorithm. Chinese Journal of Computers, 2021, 44(9): 1854−1865 doi: 10.11897/SP.J.1016.2021.01854
    [15] Levine S, Kumar A, Tucker G, Fu J. Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv: 2005.01643, 2020.
    [16] 王雪松, 王荣荣, 程玉虎. 安全强化学习综述. 自动化学报, 2023, 49(9): 1813−1835

    Wang Xue-Song, Wang Rong-Rong, Cheng Yu-Hu. Safe reinforcement learning: a survey. Acta Automatica Sinica, 2023, 49(9): 1813−1835
    [17] Fujimoto S, Gu S S. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 2021, 34: 20132−20145
    [18] Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods. In: Proceedings of the 35th International Conference on Machine Learning. Stockholm Sweden: PMLR, 2018.1587−1596
    [19] Torabi F, Warnell G, Stone P. Behavioral cloning from observation. arXiv preprint arXiv: 1805.01954, 2018.
    [20] Wu J, Wu H, Qiu Z, Wang J, Long M. Supported policy optimization for offline reinforcement learning. Advances in Neural Information Processing Systems, 2022, 35: 31278−31291
    [21] Kingma D P, Welling M. Auto-encoding variational bayes. arXiv preprint arXiv: 1312.6114, 2013.
    [22] Wang Z, Hunt J J, Zhou M. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv: 2208.06193, 2022.
    [23] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 2020, 33: 6840−6851
    [24] Tarasov D, Nikulin A, Akimov D, Kurenkov V, Kolesnikov S. CORL: Research-oriented deep offline reinforcement learning library. Advances in Neural Information Processing Systems, 2023, 36: 30997−31020
    [25] Hochreiter S, Schmidhuber J. Flat minima. Neural computation, 1997, 9(1): 1−42 doi: 10.1162/neco.1997.9.1.1
    [26] Zhao Y, Zhang H, Hu X. Penalizing gradient norm for efficiently improving generalization in deep learning. In: Proceedings of the 39th International Conference on Machine Learning. Baltimore, Maryland, USA: PMLR, 2022.26982−26992
    [27] Gao C, Xu K, Liu L, Ye D, Zhao P, Xu Z. Robust Offline Reinforcement Learning with Gradient Penalty and Constraint Relaxation. arXiv preprint arXiv: 2210.10469, 2022.
    [28] Fu J, Kumar A, Nachum O, Tucker G, Levine S. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv: 2004.07219, 2020.
    [29] Sutton R S, Barto A G. Reinforcement Learning: An Introduction (2nd edition). Cambridge: MIT Press, 2018
    [30] Sutton R S, McAllester D, Singh S, Mansour Y. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 199912
    [31] Kumar A, Zhou A, Tucker G, Levine S. Conservative q-learning for offline reinforcement learning. Advances in neural information processing systems, 2020, 33: 1179−1191
    [32] Kostrikov I, Nair A, Levine S. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv: 2110.06169, 2021.
    [33] Peng Z, Han C, Liu Y, Zhou Z. Weighted policy constraints for offline reinforcement learning. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. Washington, USA: AAAI Press, 2023. 9435−9443
  • 加载中
计量
  • 文章访问数:  40
  • HTML全文浏览量:  35
  • 被引次数: 0
出版历程
  • 收稿日期:  2024-07-05
  • 录用日期:  2024-12-23
  • 网络出版日期:  2025-04-20

目录

    /

    返回文章
    返回