A Hybrid Transfer Algorithm for Reinforcement Learning Based on Spectral Method
-
摘要: 在状态空间比例放大的迁移任务中, 原型值函数方法只能有效迁移较小特征值对应的基函数, 用于目标任务的值函数逼近时会使部分状态的值函数出现错误. 针对该问题, 利用拉普拉斯特征映射能保持状态空间局部拓扑结构不变的特点, 对基于谱图理论的层次分解技术进行了改进, 提出一种基函数与子任务最优策略相结合的混合迁移方法. 首先, 在源任务中利用谱方法求取基函数, 再采用线性插值技术将其扩展为目标任务的基函数; 然后, 用插值得到的次级基函数(目标任务的近似Fiedler特征向量)实现任务分解, 并借助改进的层次分解技术求取相关子任务的最优策略; 最后, 将扩展的基函数和获取的子任务策略一起用于目标任务学习中. 所提的混合迁移方法可直接确定目标任务部分状态空间的最优策略, 减少了值函数逼近所需的最少基函数数目, 降低了策略迭代次数, 适用于状态空间比例放大且具有层次结构的迁移任务. 格子世界的仿真结果验证了新方法的有效性.Abstract: For scaling up state space transfer underlying the proto-value function framework, only some basis functions corresponding to smaller eigenvalues are transferred effectively, which will result in wrong approximation of value function in the target task. In order to solve the problem, according to the fact that Laplacian eigenmap can preserve the local topology structure of state space, an improved hierarchical decomposition algorithm based on the spectral graph theory is proposed and a hybrid transfer method integrating basis function transfer with subtask optimal polices transfer is designed. At first, the basis functions of the source task are constructed using spectral method. The basis functions of target task are produced through linearly interpolating basis functions of the source task. Secondly, the produced second basis function of the target task (approximating Fiedler eigenvector) is used to decompose the target task. Then the optimal polices of subtasks are obtained using the improved hierarchical decomposition algorithm. At last, the obtained basis functions and optimal subtask polices are transferred to the target task. The proposed hybrid transfer method can directly get optimal policies of some states, reduce the number of iterations and the minimum number of basis functions needed to approximate the value function. The method is suitable for scaling up state space transfer task with hierarchical control structure. Simulation results of grid world have verified the validity of the proposed hybrid transfer method.
-
[1] Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998[2] Gao Yang, Chen Shi-Fu, Lu Xin. Research on reinforcement learning technology: a review. Acta Automatica Sinica, 2004, 30(1): 86-100(高阳, 陈世富, 陆鑫. 强化学习研究综述. 自动化学报, 2004, 30(1): 86-100)[3] Zhao Dong-Bin, Liu De-Rong, Yi Jian-Qiang. An overview on the adaptive dynamic programming based urban city traffic signal optimal control. Acta Automatica Sinica, 2009, 35(6): 676-681(赵冬斌, 刘德荣, 易建强. 基于自适应动态规划的城市交通信号优化控制方法综述. 自动化学报, 2009, 35(6): 676-681)[4] Barto A G, Mahadevan S. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 2003, 13(4): 341-379[5] Pan S J, Yang Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345-1359[6] Taylor M E, Stone P. Transfer learning for reinforcement learning domains: a survey. The Journal of Machine Learning Research, 2009, 10: 1633-1685[7] Wang Hao, Gao Yang, Cheng Xing-Guo. Transfer of reinforcement learning: the state of the art. Acta Electronica Sinica, 2008, 36(12a): 39-43(王皓, 高阳, 陈兴国. 强化学习中的迁移: 方法和进展. 电子学报, 2008, 36(12a): 39-43)[8] Mahadevan S, Maggioni M. Proto-value functions: a Laplacian framework for learning representation and control in Markov decision processes. The Journal of Machine Learning Research, 2007, 8: 2169-2231[9] Chiu C C, Soo V W. Automatic complexity reduction in reinforcement learning. Computational Intelligence, 2010, 26(1): 1-25[10] Simsek O, Wolfe A P, Barto A G. Identifying useful subgoals in reinforcement learning by local graph partitioning. In: Proceedings of the 22nd International Conference on Machine Learning. New York, USA: ACM, 2005. 816-823[11] Ferguson K, Mahadevan S. Proto-transfer Learning in Markov Decision Processes Using Spectral Methods, Technical Report, University Massachusetts, Amherst, 2008[12] Luo Si-Wei, Zhao Lian-Wei. Manifold learning algorithms based on spectral graph theory. Journal of Computer Research and Development, 2006, 43(7): 1174-1179(罗四维, 赵连伟. 基于谱图理论的流形学习算法. 计算机研究与发展, 2006, 43(7): 1174-1179)[13] Shi J B, Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905[14] Lagoudakis M G, Parr R. Least-squares policy iteration. Journal of Machine Learning Research, 2003, 4(12): 1107-1149[15] Wang Xue-Song, Tian Xi-Lan, Cheng Yu-Hu, Yi Jian-Qiang. Q-learning system based on cooperative least squares support vector machine. Acta Automatica Sinica, 2009, 35(2): 214-219(王雪松, 田西兰, 程玉虎, 易建强. 基于协同最小二乘支持向量机的Q学习. 自动化学报, 2009, 35(2): 214-219)[16] Xu X, Hu D W, Lu X C. Kernel-based least squares policy iteration for reinforcement learning. IEEE Transactions on Neural Networks, 2007, 18(4): 973-992[17] Chung F R K. Spectral Graph Theory. United States: American Mathematical Society, 1996[18] Sutton R S, Precup D, Singh S. Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 1999, 112(1-2): 181-211
点击查看大图
计量
- 文章访问数: 1978
- HTML全文浏览量: 102
- PDF下载量: 1271
- 被引次数: 0