采用分类经验回放的深度确定性策略梯度方法

时圣苗; 刘全

doi:10.16383/j.aas.c190406

采用分类经验回放的深度确定性策略梯度方法

doi: 10.16383/j.aas.c190406 cstr: 32138.14.j.aas.c190406

时圣苗^1,,
刘全^{1, 2, 3, 4,}

1.
苏州大学计算机科学与技术学院苏州 215006
2.
苏州大学江苏省计算机信息处理技术重点实验室苏州 215006
3.
吉林大学符号计算与知识工程教育部重点实验室长春 130012
4.
软件新技术与产业化协同创新中心南京 210000

基金项目: 国家自然科学基金 (61772355, 61702055, 61876217, 62176175), 江苏高校优势学科建设工程项目资助

详细信息

作者简介:
时圣苗：苏州大学计算机科学与技术学院硕士研究生. 主要研究方向为深度强化学习.E-mail: 20175227045@stu.suda.edu.cn

刘全：苏州大学教授. 主要研究方向为深度强化学习, 自动推理. 本文通信作者.E-mail: quanliu@suda.edu.cn

计量
- 文章访问数: 1016
- HTML全文浏览量: 486
- PDF下载量: 216
- 被引次数: 0
出版历程
- 收稿日期: 2019-05-24
- 录用日期: 2019-09-24
- 网络出版日期: 2022-06-10
- 刊出日期: 2022-07-01

Deep Deterministic Policy Gradient With Classified Experience Replay

SHI Sheng-Miao^1
,,
LIU Quan^{1, 2, 3, 4
,}

1.
School of Computer Science and Technology, Soochow University, Suzhou 215006
2.
Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou 215006
3.
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012
4.
Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000

Funds: Supported by National Natural Science Foundation of China (61772355, 61702055, 61876217, 62176175) and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD)

More Information

Author Bio:
SHI Sheng-Miao　Master student at the School of Computer Science and Technology, Soochow University. His main research interest is deep reinforcement learning

LIU Quan　Professor at Soochow University. His research interest covers deep reinforcement learning and automated reasoning. Corresponding author of this paper

摘要

摘要: 深度确定性策略梯度(Deep deterministic policy gradient, DDPG)方法在连续控制任务中取得了良好的性能表现. 为进一步提高深度确定性策略梯度方法中经验回放机制的效率, 提出分类经验回放方法, 并采用两种方式对经验样本分类: 基于时序差分误差样本分类的深度确定性策略梯度方法(DDPG with temporal difference-error classification, TDC-DDPG)和基于立即奖赏样本分类的深度确定性策略梯度方法(DDPG with reward classification, RC-DDPG).在TDC-DDPG和RC-DDPG方法中, 分别使用两个经验缓冲池, 对产生的经验样本按照重要性程度分类存储, 网络模型训练时通过选取较多重要性程度高的样本加快模型学习. 在连续控制任务中对分类经验回放方法进行测试, 实验结果表明, 与随机选取经验样本的深度确定性策略梯度方法相比, TDC-DDPG和RC-DDPG方法具有更好的性能.
- 连续控制任务 /
- 深度确定性策略梯度 /
- 经验回放 /
- 分类经验回放
Abstract: The deep deterministic policy gradient (DDPG) algorithm achieves good performance in continuous control tasks. In order to further improve the efficiency of the experience replay mechanism in the DDPG algorithm, a method of classifying the experience replay is proposed, where transitions are classified in two branches: deep deterministic policy gradient with temporal difference-error classification (TDC-DDPG) and deep deterministic policy gradient with reward classification (RC-DDPG). In both methods, two replay buffers are introduced respectively to classify the transitions according to the degree of importance. Learning can be speeded up in network model training period by selecting a greater number of transitions with higher importance. The classification experience replay method has been tested in a series of continuous control tasks and experimental results show that the TDC-DDPG and RC-DDPG methods have better performance than the DDPG method with random selection of transitions.
- Continuous control task /
- deep deterministic policy gradient (DDPG) /
- experience replay /
- classifying experience replay

HTML全文

图 1 CER-DDPG算法结构示意图

Fig. 1 CER-DDPG algorithm structure diagram

下载: 全尺寸图片幻灯片

图 2 实验效果对比图

Fig. 2 Comparison of experimental results

下载: 全尺寸图片幻灯片

图 3 CER-DDPG与最新策略梯度算法的实验对比

Fig. 3 Experimental comparison of CER-DDPG with the latest policy gradient algorithm

下载: 全尺寸图片幻灯片

表 1 连续动作任务中实验数据

Table 1 Experimental data in continuous action tasks

任务名称	算法	平均奖赏	最高奖赏	标准差
HalfCheetah	DDPG	3 360.32	5 335.23	1 246.40
	TDC-DDPG	5 349.64	9 220.27	2 368.13
	RC-DDPG	3 979.64	6 553.49	1 580.21
Ant	DDPG	551.87	1 908.30	307.86
	TDC-DDPG	521.42	1 863.99	296.91
	RC-DDPG	772.37	2 971.63	460.05
Humanoid	DDPG	404.36	822.11	114.38
	TDC-DDPG	462.65	858.34	108.20
	RC-DDPG	440.30	835.75	100.31
Walker	DDPG	506.10	1 416.00	243.02
	TDC-DDPG	521.58	1 919.15	252.95
	RC-DDPG	700.57	3 292.62	484.65
Hopper	DDPG	422.10	1 224.68	180.04
	TDC-DDPG	432.64	1 689.48	223.61
	RC-DDPG	513.45	2 050.72	257.82
Swimmer	DDPG	34.06	63.16	16.74
	TDC-DDPG	44.18	69.40	19.77
	RC-DDPG	38.44	71.70	21.59

下载: 导出CSV

参考文献(16)

[1]	张耀中, 胡小方, 周跃, 段书凯.基于多层忆阻脉冲神经网络的强化学习及应用.自动化学报, 2019, 45(08): 1536-1547. Zhang Yao-Zhong, Hu Xiao-Fang, Zhou Yue, Duan Shu-Kai. A novel reinforcement learning algorithm based on multilayer memristive spiking neural network with applications. Acta Automatic Sinica, 2019, 45(08): 1536-1547.
[2]	Dorpinghaus M, Roldan E, Neri I, Meyr H, Julicher F. An information theoretic analysis of sequential decision-making. Mathematics, 2017, 39(6): 429-437.
[3]	Yu Xi-Li. Deep reinforcement learning: an overview. Machine Learning, 2017, 12(2): 231-316.
[4]	秦蕊, 曾帅, 李娟娟, 袁勇.基于深度强化学习的平行企业资源计划.自动化学报, 2017, 43(09): 1588-1596. Qin Rui, Zeng Shuai, Li Juan-Juan, Yuan Yong. Parallel enterprises resource planning based on deep reinforcement learning. Acta Automatic Sinica, 2017, 43(9): 1588-1596.
[5]	Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou A, Wierstra D, et al. Playing atari with deep reinforcement learning. In: Proceedings of the Workshops at the 26th Neural Information Processing Systems 2013. Lake Tahoe, USA: MIT Press, 2013. 201−220
[6]	Mnih V, Kavukcuoglu K, Silver D, Andrei A, Rusu, Veness J. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529-533. doi: 10.1038/nature14236
[7]	Schaul T, Quan J, Antonoglou I, Silver D. Prioritized experience replay. In: Proceedings of the 4th International Conference on Learning Representations. San Juan, PuertoRico, USA: ICLR, 2016. 322−355
[8]	高阳, 陈世福, 陆鑫.强化学习研究综述.自动化学报, 2004, 30(1): 86-100. Gao Yang, Chen Shi-Fu, Lu Xin. Research on reinforcement learning: a review. Acta Automatic Sinica, 2004, 30(1): 86-100.
[9]	Ertel W. Reinforcement Learning. London: Springer-Verlag, 2017. 12−16
[10]	Peters J, Bagnell J A, Sammut C. Policy gradient methods. Encyclopedia of Machine Learning, 2010, 5(11): 774-776.
[11]	Sutton R S, Barto A G. Reinforcement learning: An introduction. Cambridge, USA: MIT Press, 2018.
[12]	Thomas P S, Brunskill E. Policy gradient methods for reinforcement learning with function approximation and action-dependent baselines. Artificial Intelligence, 2018, 16(4): 23-25.
[13]	Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmillerm M. Deterministic policy gradient algorithms. In: Proceedings of the 31st International Conference on Machine Learning. New York, USA: ACM, 2014. 387−395
[14]	Lillicrap T P, Hunt J J, Pritzel A, Heess N, Erez T, Tassa Y. Continuous control with deep reinforcement learning. Computer Science, 2015, 8(6): A187.
[15]	Atherton L A, Dupret D, Mellor J R. Memory trace replay: the shaping of memory consolidation by neuromodulation. Trends in Neurosciences, 2015, 38(9): 560-570. doi: 10.1016/j.tins.2015.07.004
[16]	Olafsdottir H, Barry C, Saleem AB, Hassabis D, Spiers HJ. Hippocampal place cells construct reward related sequences through unexplored space. Elife, 2015, 4: e06063. doi: 10.7554/eLife.06063