2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于参数探索的期望最大化策略搜索

程玉虎 冯涣婷 王雪松

程玉虎, 冯涣婷, 王雪松. 基于参数探索的期望最大化策略搜索. 自动化学报, 2012, 38(1): 38-45. doi: 10.3724/SP.J.1004.2012.00038
引用本文: 程玉虎, 冯涣婷, 王雪松. 基于参数探索的期望最大化策略搜索. 自动化学报, 2012, 38(1): 38-45. doi: 10.3724/SP.J.1004.2012.00038
CHENG Yu-Hu, FENG Huan-Ting, WANG Xue-Song. Expectation-maximization Policy Search with Parameter-based Exploration. ACTA AUTOMATICA SINICA, 2012, 38(1): 38-45. doi: 10.3724/SP.J.1004.2012.00038
Citation: CHENG Yu-Hu, FENG Huan-Ting, WANG Xue-Song. Expectation-maximization Policy Search with Parameter-based Exploration. ACTA AUTOMATICA SINICA, 2012, 38(1): 38-45. doi: 10.3724/SP.J.1004.2012.00038

基于参数探索的期望最大化策略搜索

doi: 10.3724/SP.J.1004.2012.00038
详细信息
    通讯作者:

    程玉虎 中国矿业大学教授. 主要研究方向为机器学习,智能优化与控制. 本文通信作者. E-mail: chengyuhu@163.com

Expectation-maximization Policy Search with Parameter-based Exploration

  • 摘要: 针对随机探索易于导致梯度估计方差过大的问题,提出一种基于参数探索的期望最大化(Expectation-maximization,EM)策略搜索方法.首先,将策略定义为控制器参数的一个概率分布.然后,根据定义的概率分布直接在控制器参数空间进行多次采样以收集样本.在每一幕样本的收集过程中,由于选择的动作均是确定的,因此可以减小采样带来的方差,从而减小梯度估计方差.最后,基于收集到的样本,通过最大化期望回报函数的下界来迭代地更新策略参数.为减少采样耗时和降低采样成本,此处利用重要采样技术以重复使用策略更新过程中收集的样本.两个连续空间控制问题的仿真结果表明,与基于动作随机探索的策略搜索强化学习方法相比,本文所提方法不仅学到的策略最优,而且加快了算法收敛速度,具有较好的学习性能.
  • [1] Zhao Dong-Bin,Liu De-Rong,Yi Jian-Qiang. An overview on the adaptive dynamic programming based urban city traffic signal optimal control. Acta Automatica Sinica,2009,35(6):676-681(赵冬斌,刘德荣,易建强. 基于自适应动态规划的城市交通信号优化控制方法综述. 自动化学报,2009,35(6):676-681)[2] Zhang W,Dietterich T G. Value function approximation and job-shop scheduling. In:Proceedings of the Workshop on Value Function Approximation,Report Number CMU-CS-95-206,School of Computer Science,Carnegie-Mellon University,USA,1995[3] Sugiyama M,Hachiya H,Towell C,Vijayakumar S. Value function approximation on non-linear manifolds for robot motor control. In:Proceedings of the IEEE International Conference on Robotics and Automation. Rome,Italy:IEEE,2007. 1733-1740[4] Barto A G,Sutton R S,Anderson C W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on System,Man and Cybernetics,1983,13(5):834-846[5] Peters J,Schaal S. Policy gradient methods for robotics. In:Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Beijing,China:IEEE,2006. 2219-2225[6] Cheng Yu-Hu,Feng Huan-Ting,Wang Xue-Song. Policy iteration reinforcement learning based on geodesic Gaussian basis defined on state-action graph. Acta Automatica Sinica,2011,37(1):44-51(程玉虎,冯涣婷,王雪松. 基于状态--动作图测地高斯基的策略迭代强化学习. 自动化学报,2011,37(1):44-51)[7] Wang Xue-Ning,Chen Wei,Zhang Meng,Xu Xin,He Han-Gen. A survey of direct policy search methods in reinforcement learning. CAAI Transactions on Intelligent Systems,2007,2(1):16-24(王学宁,陈伟,张锰,徐昕,贺汉根. 增强学习中的直接策略搜索方法综述. 智能系统学报,2007,2(1):16-24)[8] Dayan P,Hinton G E. Using expectation-maximization for reinforcement learning. Neural Computation,1997,9(2):271-278[9] Peters J,Schaal S. Reinforcement learning by reward-weighted regression for operational space control. In:Proceedings of the 24th International Conference on Machine Learning. Corvallis,USA:ACM,2007. 745-750[10] Wang Xue-Song,Tian Xi-Lan,Cheng Yu-Hu,Yi Jian-Qiang. Q-learning system based on cooperative least squares support vector machine. Acta Automatica Sinica,2009,35(2):214-219(王雪松,田西兰,程玉虎,易建强. 基于协同最小二乘支持向量机的Q学习. 自动化学报,2009,35(2):214-219)[11] Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning,1992,8(3-4):229-256[12] Rückstie\ss T,Felder M,Schmidhuber J. State-dependent exploration for policy gradient methods. In:Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases. Antwerp,Belgium:Springer,2008. 234-249[13] Peters J,Kober J. Using reward-weighted imitation for robot reinforcement learning. In:Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. Nashville,USA:IEEE,2009. 226-232[14] Sehnke F,Osendorfer C,Rückstie\ss T,Graves A,Peters J,Schmidhuber J. Parameter-exploring policy gradients. Neural Networks,2010,23(4):551-559[15] Tang Hao,Wan Hai-Feng,Han Jiang-Hong,Zhou Lei. Coordinated look-ahead control of multiple CSPS system by multi-agent reinforcement learning. Acta Automatica Sinica,2010,36(2):289-296(唐昊,万海峰,韩江洪,周雷. 基于多Agent强化学习的多站点CSPS系统的协作Look-ahead 控制. 自动化学报,2010,36(2):289-296)[16] Hachiya H,Peters J,Sugiyama M. Efficient sample reuse in EM-based policy search. In:Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases. Bled,Slovenia:Springer,2009. 469-484[17] Riedmiller M,Peters J,Schaal S. Evaluation of policy gradient methods and variants on the cart-pole benchmark. In:Proceedings of the IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning. Honolulu,USA:IEEE,2007. 254-261[18] Peters J,Vijayakumar S,Schaal S. Natural actor-critic. In:Proceedings of the 16th European Conference on Machine Learning. Porto,Portugal:Springer,2005. 280-291
  • 加载中
计量
  • 文章访问数:  2176
  • HTML全文浏览量:  73
  • PDF下载量:  814
  • 被引次数: 0
出版历程
  • 收稿日期:  2011-05-24
  • 修回日期:  2011-08-30
  • 刊出日期:  2012-01-20

目录

    /

    返回文章
    返回