2.624

2020影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

一种加速时间差分算法收敛的方法

何斌 刘全 张琳琳 时圣苗 陈红名 闫岩

何斌,  刘全,  张琳琳,  时圣苗,  陈红名,  闫岩.  一种加速时间差分算法收敛的方法.  自动化学报,  2021,  47(7): 1679−1688 doi: 10.16383/j.aas.c190140
引用本文: 何斌,  刘全,  张琳琳,  时圣苗,  陈红名,  闫岩.  一种加速时间差分算法收敛的方法.  自动化学报,  2021,  47(7): 1679−1688 doi: 10.16383/j.aas.c190140
He Bin,  Liu Quan,  Zhang Lin-Lin,  Shi Sheng-Miao,  Chen Hong-Ming,  Yan Yan.  A method of accelerating the convergence of temporal difference learning.  Acta Automatica Sinica,  2021,  47(7): 1679−1688 doi: 10.16383/j.aas.c190140
Citation: He Bin,  Liu Quan,  Zhang Lin-Lin,  Shi Sheng-Miao,  Chen Hong-Ming,  Yan Yan.  A method of accelerating the convergence of temporal difference learning.  Acta Automatica Sinica,  2021,  47(7): 1679−1688 doi: 10.16383/j.aas.c190140

一种加速时间差分算法收敛的方法

doi: 10.16383/j.aas.c190140
基金项目: 国家自然科学基金项目(61772355, 61702055, 61502323, 61502329),> 江苏省高等学校自然科学研究重大项目(18KJA520011, 17KJA520004),吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04, 93K172017K18),苏州市应用基础研究计划工业部分(SYG201422)资助
详细信息
    作者简介:

    何斌:苏州大学计算机科学与技术学院硕士研究生. 主要研究方向为强化学习. E-mail: hebinyc@gmail.com

    刘全:苏州大学计算机科学与技术学院教授. 2004年获得吉林大学博士学位. 主要研究方向为强化学习,深度学习,深度强化学习. 本文通信作者. E-mail: quanliu@suda.edu.cn

    张琳琳:苏州大学计算机科学与技术学院硕士研究生. 主要研究方向为强化学习. E-mail: 20175227007@stu.suda.edu.cn

    时圣苗:苏州大学计算机科学与技术学院硕士研究生. 主要研究方向为深度强化学习. E-mail: 20175227045@stu.suda.edu.cn

    陈红名:苏州大学计算机科学与技术学院硕士研究生. 主要研究方向为深度强化学习. E-mail: 20174227007@stu.suda.edu.cn

    闫岩:苏州大学计算机科学与技术学院硕士研究生. 主要研究方向为深度强化学习. E-mail: 20165222006@stu.suda.edu.cn

  • 收稿日期 2019-03-07 录用日期 2019-08-22 Manuscript received March 7, 2019; accepted August 22, 2019 国家自然科学基金项目 (61772355, 61702055, 61502323, 61502329), 江苏省高等学校自然科学研究重大项目 (18KJA520011, 17KJA520004),吉林大学符号计算与知识工程教育部重点实验室资助项目 (93K172014K04, 93K172017K18),苏州市应用基础研究计划工业部分 (SYG201422) 资助 Supported by National Natural Science Foundation of China (61772355, 61702055, 61502323, 61502329), Jiangsu Province Natural Science Research University Major Projects (18KJA520011, 17KJA520004), Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172014K04, 93K172017K18), Suzhou Industrial Application of Basic Research Program (SYG201422) 本文责任编委 魏庆来 Recommended by Associate Editor WEI Qing-Lai 1. 苏州大学计算机科学与技术学院 苏州 215006 2. 苏州大学江苏省计算机信息处理技术重点实验室 苏州 215006 3. 吉林大学符号计算与知识工程教育部重点实验室 长春 130012 4. 软件新技术与产业化协同创新中心 南京 210000 1. School of Computer Science and Technology, Soochow University, Suzhou 215006    2. Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou 215006    3. Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012    4. Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000

A Method of Accelerating the Convergence of Temporal Difference Learning

Funds: Supported by National Natural Science Foundation of China (61772355, 61702055, 61502323, 61502329), Jiangsu Province Natural Science Research University Major Projects (18KJA520011, 17KJA520004), Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172014K04, 93K172017K18), Suzhou Industrial Application of Basic Research Program (SYG201422)
More Information
    Author Bio:

    HE Bin Master student at the School of Computer Science and Technology, Soochow University. His main research interest is reinforcement learning

    LIU Quan Professor at the School of Computer Science and Technology, Soochow University. He received his Ph. D. degree from Jilin University in 2004. His research interest covers reinforcement learning, deep learning and deep reinforcement learning. Corresponding author of this paper

    ZHANG Lin-Lin Master student at the School of Computer Science and Technology, Soochow University. Her main research interest is reinforcement learning

    SHI Sheng-Miao Master student at the School of Computer Science and Technology, Soochow University. His main research interest is deep reinforcement learning

    CHEN Hong-Ming Master student at the School of Computer Science and Technology, Soochow University. His main research interest is deep reinforcement learning

    YAN Yan Master student at the School of Computer Science and Technology, Soochow University. His main research interest is deep reinforcement learning

  • 摘要:

    时间差分算法(Temporal difference methods, TD)是一类模型无关的强化学习算法. 该算法拥有较低的方差和可以在线(On-line)学习的优点, 得到了广泛的应用. 但对于一种给定的TD算法, 往往只能通过调整步长参数或其他超参数来加速收敛, 这也就造成了加速TD算法收敛的方法匮乏. 针对此问题提出了一种利用蒙特卡洛算法(Monte Carlo methods, MC)来加速TD算法收敛的方法(Accelerate TD by MC, ATDMC). 该方法不仅可以适用于绝大部分的TD算法, 而且不需要改变在线学习的方式. 为了证明方法的有效性, 分别在同策略(On-policy)评估、异策略(Off-policy)评估和控制(Control)三个方面进行了实验. 实验结果表明ATDMC方法可以有效地加速各类TD算法.

    1)  收稿日期 2019-03-07 录用日期 2019-08-22 Manuscript received March 7, 2019; accepted August 22, 2019 国家自然科学基金项目 (61772355, 61702055, 61502323, 61502329), 江苏省高等学校自然科学研究重大项目 (18KJA520011, 17KJA520004),吉林大学符号计算与知识工程教育部重点实验室资助项目 (93K172014K04, 93K172017K18),苏州市应用基础研究计划工业部分 (SYG201422) 资助 Supported by National Natural Science Foundation of China (61772355, 61702055, 61502323, 61502329), Jiangsu Province Natural Science Research University Major Projects (18KJA520011, 17KJA520004), Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172014K04, 93K172017K18), Suzhou Industrial Application of Basic Research Program (SYG201422) 本文责任编委 魏庆来 Recommended by Associate Editor WEI Qing-Lai 1. 苏州大学计算机科学与技术学院 苏州 215006 2. 苏州大学江苏省计算机信息处理技术重点实验室 苏州 215006 3. 吉林大学符号计算与知识工程教育部重点实验室 长春 130012 4. 软件新技术与产业化协同创新中心 南京 210000 1. School of Computer Science and Technology, Soochow University, Suzhou 215006    2. Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou 215006    3. Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012    4. Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000
  • 图  1  $U_{TD}$构成的不稳定序列

    Fig.  1  A non-stationary series

    图  2  改进后的GPI

    Fig.  2  The improved GPI

    图  3  博扬链

    Fig.  3  Boyan chain

    图  4  同策略评估

    Fig.  4  On-policy estimation

    图  5  11状态的随机漫步

    Fig.  5  11-State random walk

    图  6  异策略评估

    Fig.  6  Off-policy estimation

    图  7  山地车问题 $(\beta$固定)

    Fig.  7  Mountain car (fixed $\beta)$

    图  8  山地车问题$(\alpha$固定)

    Fig.  8  Mountain car (fixed $\alpha)$

    图  9  平衡杆问题$(\beta$固定)

    Fig.  9  Cart pole (fixed $\beta)$

    图  10  平衡杆问题$(\alpha $固定)

    Fig.  10  Cart pole (fixed $\alpha) $

  • [1] Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 2018.
    [2] 刘乃军, 鲁涛, 蔡莹皓, 王硕. 机器人操作技能学习方法综述. 自动化学报, 2019, 45(3): 458−470

    Liu Nai-Jun, Lu Tao, Cai Ying-Hao, Wang Shuo. A review of robot manipulation skills learning methods. Acta Automatica Sinica, 2019, 45(3): 458−470
    [3] Polydoros A S, Nalpantidis L. Survey of model-based reinforcement learning: applications on robotics. Journal of Intelligent & Robotic Systems, 2017, 86(2): 153−173
    [4] 王鼎. 基于学习的鲁棒自适应评判控制研究进展. 自动化学报, 2019, 45(6): 1031−1043

    Wang Ding. Research progress on learning-based robust adaptive critic control. Acta Automatica Sinica, 2019, 45(6): 1031−1043
    [5] Wang Fei-Yue, Zheng Nan-Ning, Cao Dong-Pu, Clara M M, Li Li, Liu Teng. Parallel driving in CPSS: A unified approach for transport automation and vehicle intelligence. IEEE/CAA Journal of Automatica Sinica, 2017, 4(4): 577−587 doi: 10.1109/JAS.2017.7510598
    [6] Du S S, Chen Jian-Shu, Li Li-Hong, Xiao Lin, Zhou Deng-Yong. Stochastic variance reduction methods for policy evaluation. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. Sydney, NSW, Australia: PMLR, 2017. 1049–1058
    [7] Lyu Dao-Ming, Liu Bo, Geist M, Dong Wen, Biaz S, Wang Qi. Stable and efficient policy evaluation. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(6): 1831−1840 doi: 10.1109/TNNLS.2018.2871361
    [8] Dabney W C. Adaptive step-sizes for reinforcement learning [Ph. D. dissertation], University of Massachusetts Amherst, 2014
    [9] Young K, Wang Bao-Xiang, Taylor M E. Metatrace actor-critic: Online step-size tuning by meta-gradient descent for reinforcement learning control. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. Macau, China: ijcai.org, 2019. 4185–4191
    [10] Szepesvári C. Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 2010, 4(1): 1−103
    [11] Busoniu L, Babuska R, De Schutter B, Ernst D. Reinforcement Learning and Dynamic Programming Using Function Approximators. Florida: CRC Press, 2017.
    [12] Boyan J A. Least-squares temporal difference learning. In: Proceedings of the 16th International Conference on Machine Learning. Bled, Slovenia: Morgan Kaufmann, 1999. 49–56
    [13] Sutton R S, Szepesvári C, Maei H R. A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. Advances in Neural Information Processing Systems, 2008, 21(21): 1609−1616
    [14] Sutton R S, Maei H R, Precup D, Bhatnagar S, Silver D, Szepesvári C, et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of the 26th International Conference on Machine Learning. Montreal, Quebec, Canada: ACM, 2009. 993–1000
    [15] Dann C, Neumann G, Peters J. Policy evaluation with temporal differences: A survey and comparison. The Journal of Machine Learning Research, 2014, 15(1): 809−883
    [16] Van Seijen H, Sutton R S. True online TD (lambda). In: Proceedings of the 31st International Conference on Machine Learning. Beijing, China: JMLR.org, 2014. 692–700
    [17] Precup D, Sutton R S, Dasgupta S. Off-policy temporal-difference learning with function approximation. In: Proceedings of the 18th International Conference on Machine Learning. Williams College, Williamstown, MA, USA: Morgan Kaufmann, 2001. 417–424
    [18] Maei H R, Sutton R S. GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In: Proceedings of the 3rd Conference on Artficial General Intelligence. Lugano, Switzerland: Atlantis Press, 2010. 91–96
    [19] Precup D, Sutton R S, Singh S. Eligibility traces for off-policy policy evaluation. In: Proceedings of the Seventeenth International Conference on Machine Learning. Stanford, CA, USA: Morgan Kaufmann, 2000. 759–766
    [20] Sutton R S. Learning to predict by the method of temporal differences. Machine Learning, 1988, 3(1): 9−44
    [21] Van Seijen H, Mahmood A R, Pilarski P M, Machado M C, Sutton R S. True online temporal-difference learning. The Journal of Machine Learning Research, 2016, 17(1): 5057−5096
    [22] Maei H R. Gradient temporal-difference learning algorithms [Ph. D. dissertation], University of Alberta, 2014
    [23] White A, White M. Investigating practical linear temporal difference learning. In: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems. Singapore, Singapore: ACM, 2016. 494–502
    [24] Sutton R S, Mahmood A R, White M. An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 2016, 17(1): 2603−2631
    [25] Yu H. Weak convergence properties of constrained emphatic temporal-difference learning with constant and slowly diminishing stepsize. The Journal of Machine Learning Research, 2016, 17(1): 7745−7802
    [26] Rummery G A. Problem solving with reinforcement learning [Ph. D. dissertation], University of Cambridge, 1995
  • 加载中
图(10)
计量
  • 文章访问数:  310
  • HTML全文浏览量:  215
  • PDF下载量:  100
  • 被引次数: 0
出版历程
  • 收稿日期:  2019-03-07
  • 录用日期:  2019-08-22
  • 网络出版日期:  2021-06-30
  • 刊出日期:  2021-07-20

目录

    /

    返回文章
    返回