Learning Control of Dynamical Systems Based on Markov Decision Processes: Research Frontiers and Outlooks
-
摘要: 基于马氏决策过程(Markov decision process, MDP)的动态系统学习控制是近年来一个涉及机器学习、控制理论和运筹学等多个学科的交叉研究方向, 其主要目标是实现系统在模型复杂或者不确定等条件下基于数据驱动的多阶段优化控制. 本文对基于MDP的动态系统学习控制理论、算法与应用的发展前沿进行综述,重点讨论增强学习(Reinforcement learning, RL)与近似动态规划(Approximate dynamic programming, ADP)理论与方法的研究进展,其中包括时域差值学习理论、求解连续状态与行为空间MDP的值函数逼近方法、 直接策略搜索与近似策略迭代、自适应评价设计算法等,最后对相关研究领域的应用及发展趋势进行分析和探讨.Abstract: Learning control of dynamical systems based on Markov decision processes (MDPs) is an interdisciplinary research area of machine learning, control theory, and operations research. The main objective in this research area is to realize data-driven multi-stage optimal control for complex or uncertain dynamical systems. This paper presents a comprehensive survey on the theory, algorithms, and applications of MDP-based learning control of dynamical systems. Emphases are put on recent advances in the theory and methods of reinforcement learning (RL) and adaptive/approximate dynamic programming (ADP), including temporal-difference learning theory, value function approximation for continuous state and action spaces, direct policy search, approximate policy iteration, and adaptive critic designs. Applications and the trends for future research and developments in related fields are also discussed.
-
[1] Sklansky J. Learning systems for automatic control. IEEE Transactions on Automatic Control, 1966, 11(1): 6-19[2] Fu K S. Learning control systems: review and outlook. IEEE Transactions on Automatic Control, 1970, 15(2): 210-221[3] Fu K S. Learning control systems and intelligent control systems: an intersection of artifical intelligence and automatic control. IEEE Transactions on Automatic Control, 1971, 16(1): 70-72[4] Saridis G N. Foundations of the theory of intelligent controls. In: Proceedings of the IEEE Workshop on Intelligent Control. New York, USA: IEEE, 1985. 23-28[5] Bristow D A, Tharayil M, Alleyne A G. A survey of iterative learning control a learning-based method for high-performance tracking control. IEEE Control Systems Magazine, 2006, 26(3): 96-114[6] Kaelbling L P, Littman M L, Moore A P. Reinforcement learning: a survey. Journal of Artificial Intelligence Research, 1996, 4: 237-285[7] Bertsekas D P. Dynamic Programming and Optimal Control (Volume 2). Belmont, MA: Athena Scientific, 1995[8] Puterman M L. Markov Decision Processes. New York, USA: Wiley, 1994[9] Sutton R, Barto A G. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998[10] Wang F Y, Zhang H G, Liu D R. Adaptive dynamic programming: an introduction. IEEE Computational Intelligence Magazine, 2009, 4(2): 39-47[11] Powell W B. Approximate Dynamic Programming: Solving the Curses of Dimensionality. New York: Wiley, 2007[12] Bertsekas D P, Tsitsiklis J N, Siklis J T. Neuro-Dynamic Programming. Belmont, MA: Athena Scientific, 1996[13] Liu D R. Approximate dynamic programming for adaptive control. Acta Automatica Sinica, 2005, 31(1): 13-18[14] Lewis F L, Vrabie D. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits and Systems Magazine, 2009, 9(3): 32-50[15] Sutton R S, Barto A G, Williams R J. Reinforcement learning is direct adaptive optimal control. In: Proceedings of the American Control Conference. Waltham, MA: GTE Laboratories Inc., 1991. 2143-2146[16] Wang F Y, Jin N, Liu D R, Wei Q L. Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ε -error bound. IEEE Transactions on Neural Networks, 2011, 22(1): 24-36[17] Wang F Y, Saridis G N. Suboptimal control for nonlinear stochastic systems. In: Proceedings of the 31st IEEE Conference on Decision and Control. Tucson, Arizona, USA: IEEE, 1992. 1856-1861[18] Saridis G N, Wang F Y. Suboptimal control for nonlinear stochastic systems. Control Theory and Advanced Technology, 1994, 10(4): 847-871[19] Wang F Y, Saridis G N. On successive approximation of optimal control of stochastic dynamic systems. Modeling Uncertainty: International Series in Operations Research and Management Science. New York, NY: Springer, 2005. 333-358[20] Murray J J, Cox C J, Lendaris G G, Saeks R. Adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2002, 32(2): 140-153[21] Prokhorov D V, Wunsch D C II. Adaptive critic designs. IEEE Transactions on Neural Networks, 1997, 8(5): 997- 1007[22] Saridis G N. Self-Organizing Control of Stochastic Systems. New York: M. Dekker, 1977[23] Saridis G N [Author], Zheng Ying-Ping [Translator]. Self-Organizing Control of Stochastic Systems. Beijing: Science Press, 1984 (Saridis G N [著], 郑应平 [译]. 随机系统的自组织控制. 北京: 科学出版社, 1984)[24] Arimoto S, Kawamura S, Miyazaki F. Bettering operation of robots by learning. Journal of Robotic Systems, 1984, 1(2): 123-140[25] Ahn H S, Chen Y Q, Moore K L. Iterative learning control: brief survey and categorization. IEEE Transactions on System, Man, and Cybernetics Part C: Applications and Reviews, 2007, 37(6): 1099-1121[26] Wang Y Q, Gao F R, Doyle F J III. Survey on iterative learning control, repetitive control, and run-to-run control. Journal of Process Control, 2009, 19(10): 1589-1600[27] Sun Ming-Xuan, Wang Dan-Wei, Chen Peng-Nian. Repetitive learning control for finite horizon nonlinear system. Science China: Information Sciences, 2010, 40(3): 433-444 (孙明轩, 王郸维, 陈彭年. 有限区间非线性系统的重复学习控制. 中国科学: 信息科学, 2010, 40(3): 433-444)[28] Saab S S. Selection of the learning gain matrix of an iterative learning control algorithm in presence of measurement noise. IEEE Transactions on Automatic Control, 2005, 50(11): 1761-1774[29] Chen H F, Fang H T. Output tracking for nonlinear stochastic systems by iterative learning control. IEEE Transactions on Automatic Control, 2004, 49(4): 583-588[30] Saab S S. A discrete-time stochastic learning control algorithm. IEEE Transactions on Automatic Control, 2001, 46(6): 877-887[31] Chen H F. Almost sure convergence of iterative learning control for stochastic systems. Science in China Series F: Information Sciences, 2003, 46(1): 69-79[32] Tan K K, Zhao S, Huang S, Lee T H, Tay A. A new repetitive control for LTI systems with input delay. Journal of Process Control, 2009, 19(4): 711-716[33] Quan Q, Yang D, Cai K Y, Jiang J. Repetitive control by output error for a class of uncertain time-delay systems. IET Control Theory and Applications, 2009, 3(9): 1283-1292[34] Pipeleers G, Demeulenaere B, Al-Bender F, De Schutter J, Swevers J. Optimal performance tradeoffs in repetitive control: experimental validation on an active air bearing setup. IEEE Transactions on Control Systems Technology, 2009, 17(4): 970-979[35] Wu M, Zhou L, She J H. Design of observer-based H∞ robust repetitive-control system. IEEE Transactions on Automatic Control, 2011, 56(6): 1452-1457[36] Werbos P J. Neural networks for control and system identification. In: Proceedings of the 28th IEEE Conference on Decision and Control. Tampa, USA: IEEE, 1989. 260-265[37] Antsaklis P J. Neural networks for control systems. IEEE Transactions on Neural Networks, 1990, 1(2): 242-244[38] Narendra K S, Parthasarathy K. Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1990, 1(1): 4-27[39] Liu G P. Nonlinear Identification and Control: A Neural Network Approach. New York: Springer-Verlag, 2001[40] Yu W. Nonlinear system identification using discrete-time recurrent neural networks with stable learning algorithms. Information Sciences, 2004, 158: 131-147[41] Goethals I, Pelckmans K, Suykens J A K, De Moor B. Identification of MIMO Hammerstein models using least squares support vector machines. Automatica, 2005, 41(7): 1263- 1272[42] Martinez-Ramon M, Rojo-Alvarez J L, Camps-Valls G, Munoz-Mari J, Navia-Vazquez A, Soria-Olivas E, Figueiras-Vidal A R. Support vector machines for nonlinear kernel ARMA system identification. IEEE Transactions on Neural Networks, 2006, 17(6): 1617-1622[43] Wang X D, Ye M Y. Nonlinear dynamic system identification using least squares support vector machine regression. In: Proceedings of 2004 International Conference on Machine Learning and Cybernetics. Shanghai, China: IEEE, 2004. 941-945[44] Goethals I, Pelckmans K, Suykens J A K, De Moor B. Subspace identification of Hammerstein systems using least squares support vector machines. IEEE Transactions on Automatic Control, 2005, 50(10): 1509-1519[45] Du J Y, Wang M. Nonlinear dead zone system identification based on support vector machine. In: Proceedings of the 6th International Symposium on Neural Networks. Wuhan, China: Springer, 2009. 235-243[46] Al-Ghanim A. An unsupervised learning neural algorithm for identifying process behavior on control charts and a comparison with supervised learning approaches. Computers and Industrial Engineering, 1997, 32(3): 627-639[47] Le Tallec Y. Robust, Risk-Sensitive, and Data-Driven Control of Markov Decision Processes [Ph.D. dissertation], Massachusetts Institute of Technology, USA, 2007[48] Lee J M, Lee J H. Approximate dynamic programming-based approaches for input-output data-driven control of nonlinear processes. Automatica, 2005, 41(7): 1281-1288[49] Sutton R S. Learning to predict by the methods of temporal differences. Machine Learning, 1988, 3(1): 9-44[50] Seymour B, O'Doherty J P, Dayan P, Koltzenburg M, Jones A K, Dolan R J, Friston K J, Frackowiak R S. Temporal difference models describe higher-order learning in humans. Nature, 2004, 429(6992): 664-667[51] Xu X. A sparse kernel-based least-squares temporal difference algorithm for reinforcement learning. In: Proceedings of 2006 International Conference on Natural Computation. Yantai, China: Springer, 2006. 47-56[52] Watkins C J C H, Dayan P. Q-Learning. Machine Learning, 1992, 8(3-4): 279-292[53] Singh S P, Jaakkola T, Littman M L, Szepesvári C. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 2000, 38(3): 287-308[54] Baird L. Residual algorithms: reinforcement learning with function approximation. In: Proceedings of the 12th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufman Publishers, 1995. 30-37[55] Xu X, He H G. Residual-gradient-based neural reinforcement learning for the optimal control of an acrobot. In: Proceedings of the IEEE International Symposium on Intelligent Control. Vancouver, Canada: IEEE, 2002. 758-763[56] Tsitsiklis J N, Van Roy B. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997, 42(5): 674-690[57] Boyan J A. Technical update: least-squares temporal difference learning. Machine Learning, 2002, 49(2-3): 233-246[58] Bradtke S J, Barto A G. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996, 22(1-3): 33-57[59] Xu X, He H G, Hu D W. Efficient reinforcement learning using recursive least-squares methods. Journal of Artificial Intelligence Research, 2002, 16: 259-292[60] Xu X, Xie T, Hu D W, Lu X C. Kernel least-squares temporal difference learning. International Journal of Information Technology, 2005, 11(9): 54-63[61] Engel Y, Mannor S, Meir R. The kernel recursive least-squares algorithm. IEEE Transactions on Signal Processing, 2004, 52(8): 2275-2285[62] Geramifard A, Bowling M, Sutton R S. Incremental least-squares temporal difference learning. In: Proceedings of the 21st Association for the Advancement of Artificial Intelligence (AAAI) on Artificial Intelligence. Boston, Massachusetts, USA: AAAI Press, 2006. 356-361[63] Johns J, Petrik M, Mahadevan S. Hybrid least-squares algorithms for approximate policy evaluation. Machine Learning, 2009, 76(2-3): 243-256[64] Sutton R S, Maei H R, Precup D, Bhatnagar S, Silver D, Szepesvári C, Wiewiora E. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of the 26th International Conference on Machine Learning. Montreal, Canada: ACM, 2009. 993-1000[65] Xu Xin, He Han-Gen. A gradient algorithm for neural-network-based reinforcement learning. Chinese Journal of Computers, 2003, 26(2): 227-233 (徐昕, 贺汉根. 神经网络增强学习的梯度算法研究. 计算机学报, 2003, 26(2): 227-233)[66] Gao Yang, Chen Shi-Fu, Lu Xin. Research on reinforcement learning technology: a review. Acta Automatica Sinica, 2004, 30(1): 86-100 (高阳, 陈世福, 陆鑫. 强化学习研究综述. 自动化学报, 2004, 30(1): 86-100)[67] Heger M. The loss from imperfect value functions in expectation-based and minimax-based tasks. Machine Learning, 1996, 22(1-3): 197-225[68] Schlkopf B, Smola A J. Learning with Kernels. Cambridge: MIT Press, 2002[69] Vapnik V N. Statistical Learning Theory. New York: Wiley-Interscience, 1998[70] Lanckriet G R G, Cristianini N, Bartlett P L, El Ghaoui L, Jordan M I. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 2004, 5: 27-72[71] Ormnoneit D, Sen S. Kernel-based reinforcement learning. Machine Learning, 2002, 49(2-3): 161-178[72] Xu X, Hu D W, Lu X C. Kernel-based least squares policy iteration for reinforcement learning. IEEE Transactions on Neural Networks, 2007, 18(4): 973-997[73] Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8(3-4): 229-256[74] Baxter J, Bartlett P L. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 2001, 15(1): 319-350[75] Wang Xue-Ning, Xu Xin, Wu Tao, He Han-Gen. The optimal reward baseline for policy-gradient reinforcement learning. Chinese Journal of Computers, 2005, 28(6): 1021-1026 (王学宁, 徐昕, 吴涛, 贺汉根. 策略梯度强化学习中的最优回报基线. 计算机学报, 2005, 28(6): 1021-1026)[76] Schraudolph N N, Yu J, Aberdeen D. Fast online policy gradient learning with SMD gain vector adaptation. In: Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2006. 1185-1192[77] Sutton R S, McAllester D, Singh S, Mansour Y. Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems 12. Cambridge, MA: MIT Press, 2000. 1057-1063[78] Lagoudakis M G, Parr R. Least-squares policy iteration. Journal of Machine Learning Research, 2003, 4: 1107-1149[79] Ghavamzadeh M, Engel Y. Bayesian policy gradient algorithms. In: Advances in Neural Information Processing Systems 19. Cambridge, MA: MIT Press, 2007. 457-464[80] Xu X, Liu C M, Hu D W. Continuous-action reinforcement learning with fast policy search and adaptive basis function selection. Soft Computing-A Fusion of Foundations, Methodologies and Applications, 15(6): 1055-1070[81] Mahadevan S. Representation policy iteration. In: Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence. Edinburgh, Scotland: AUAI Press, 2005. 372-379[82] Barto A G, Sutton R S, Anderson C W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on System, Man, and Cybernetics, 1983, 13(5): 834-846[83] Konda V R, Tsitsiklis J N. On actor-critic algorithms. SIAM Journal of Control and Optimization, 2001, 42(4): 1143- 1166[84] Prokhorov D V, Santiago R A, Wunsch II D C. Adaptive critic designs: a case study for neurocontrol. Neural Networks, 1995, 8(9): 1367-1372[85] Saeks R, Cox C J, Neidhoefer J, Mays P R, Murray J J. Adaptive control of a hybrid electric vehicle. IEEE Transactions on Intelligent Transportation Systems, 2002, 3(4): 213-234[86] Ferrari S, Stengel R. Online adaptive critic flight control. Journal of Guidance, Control, and Dynamics, 2004, 27(5): 777-786[87] Mohagheghi S, del Valle Y, Venayagamoorthy G K, Harley R G. A proportional-integrator type adaptive critic design-based neurocontroller for a static compensator in a multimachine power system. IEEE Transactions on Industrial Electronics, 2007, 54(1): 86-96[88] Lu C, Si J, Xie X R. Direct heuristic dynamic programming for damping oscillations in a large power system. IEEE Transactions on System, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 1008-1013[89] Dalamagkidis K, Kolokotsa D, Kalaitzakis K, Stavrakakis G S. Reinforcement learning for energy conservation and comfort in buildings. Building and Environment, 2007, 42(7): 2686-2698[90] Al-Tamimi A, Lewis F L, Abu-Khalaf M. Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control. Automatica, 2007, 43(3): 473-481[91] Al-Tamimi A, Abu-Khalaf M, Lewis F L. Adaptive critic designs for discrete-time zero-sum games with application to H∞ control. IEEE Transactions on System, Man, and Cybernetics, Part B: Cybernetics, 2007, Automatica, 2007, 37(1): 240-247[92] Al-Tamimi A, Lewis F L, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Transactions on System, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 943-949[93] Wei Q L, Zhang H G, Liu D R, Zhao Y. An optimal control scheme for a class of discrete-time nonlinear systems with time delays using adaptive dynamic programming. Acta Automatica Sinica, 2010, 36(1): 121-129[94] Song R Z, Zhang H G, Luo Y H, Wei Q L. Optimal control laws for time-delay systems with saturating actuators based on heuristic dynamic programming. Neurocomputing, 2010, 73(16-18): 3020-3027[95] Zhang H G, Luo Y H, Liu D R. Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints. IEEE Transactions on Neural Networks, 2009, 20(9): 1490-1503[96] Enns R, Si J. Apache helicopter stabilization using neural dynamic programming. Journal of Guidance, Control, and Dynamics, 2002, 25(1): 19-25[97] Hasegawa Y, Fukuda T, Shimojima K. Self-scaling reinforcement learning for fuzzy logic controller-applications to motion control of two-link brachiation robot. IEEE Transactions on Industrial Electronics, 1999, 46(6): 1123-1131[98] Dong D Y, Chen C L, Chu J, Tarn T J. Robust quantum-inspired reinforcement learning for robot navigation. IEEE-ASME Transactions on Mechatronics, 2012, 17(1): 86-97[99] Xu Xin. Reinforcement Learning and Approximate Dynamic Programming. Beijing: Science Press, 2010 (徐昕. 增强学习与近似动态规划. 北京: 科学出版社, 2010)[100] Meng J E, Chang D. Obstacle avoidance of a mobile robot using hybrid learning approach. IEEE Transactions on Industrial Electronics, 2005, 52(3): 898-905[101] Lin W S, Chang L H, Yang P C. Adaptive critic anti-slip control of wheeled autonomous robot. IET Control Theory and Applications, 2007, 1(1): 51-57[102] Juang C F, Hsu C H. Reinforcement ant optimized fuzzy controller for mobile-robot wall-following control. IEEE Transactions on Industrial Electronics, 2009, 56(10): 3931- 3940[103] Chen C L, Li H X, Dong D Y. Hybrid control for robot navigation --- A hierarchical Q-learning algorithm. IEEE Robotics and Automation Magazine, 2008, 15(2): 37-47[104] Mohagheghi S, del Valle Y, Venayagamoorthy G K, Harley R G. A proportional-integrator type adaptive critic design-based neurocontroller for a static compensator in a multimachine power system. IEEE Transactions on Industrial Electronics, 2007, 54(1): 86-96[105] Mohagheghi S, Venayagamoorthy G K, Harley R G. Adaptive critic design based neuro-fuzzy controller for a static compensator in a multimachine power system. IEEE Transactions on Power Systems, 2006, 21(4): 1744-1754[106] Park J W, Harley R G, Venayagamoorthy G K. Adaptive-critic-based optimal neurocontrol for synchronous generators in a power system using MLP/RBF neural networks. IEEE Transactions on Industry Applications, 2003, 39(5): 1529-1540[107] Ray S, Venayagamoorthy G K, Watanabe E H. A computational approach to optimal damping controller design for a GCSC. IEEE Transactions on Power Delivery, 2008, 23(3): 1673-1681[108] Shih P, Kaul B C, Jagannathan S, Drallmeier J A. Reinforcement-learning-based dual-control methodology for complex nonlinear discrete-time systems with application to spark engine EGR operation. IEEE Transactions on Neural Networks, 2008, 19(8): 1369-1388[109] Liu D R, Javaherian H, Kovalenko O, Huang T. Adaptive critic learning techniques for engine torque and air-fuel ratio control. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 988-993[110] Padhi R, Balakrishnan S N. Optimal management of beaver population using a reduced-order distributed parameter model and single network adaptive critics. IEEE Transactions on Control Systems Technology, 2006, 14(4): 628-640[111] Hwang K S, Chao H J. Adaptive reinforcement learning system for linearization control. IEEE Transactions on Industrial Electronics, 2000, 47(5): 1185-1188[112] Yen G G, DeLima P G. Improving the performance of globalized dual heuristic programming for fault tolerant control through an online learning supervisor. IEEE Transactions on Automation Science and Engineering, 2005, 2(2): 121-131[113] Iyer M S, Wunsch D C II. Dynamic re-optimization of a fed-batch fermentor using adaptive critic designs. IEEE Transactions on Neural Networks, 2001, 12(6): 1433-1444[114] Bertsekas D P, Homer M L, Logan D A, Patek S D, Sandell N R. Missile defense and interceptor allocation by neuro-dynamic programming. IEEE Transactions on System, Man, and Cybernetics, Part A: Systems and Humans, 2000, 30(1): 42-51[115] Lin C K. Adaptive critic autopilot design of bank-to-turn missiles using fuzzy basis function networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2005, 35(2): 197-207[116] Fakhrazari A, Boroushaki M. Adaptive critic-based neurofuzzy controller for the steam generator water level. IEEE Transactions on Nuclear Science, 2008, 55(3): 1678-1685[117] Galstyan A, Czajkowski K, Lerman K. Resource allocation in the grid using reinforcement learning. In: Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multiagent Systems. New York, USA: IEEE, 2004. 1314-1315[118] Venayagamoorthy G K, Zha W. Comparison of nonuniform optimal quantizer designs for speech coding with adaptive critics and particle swarm. IEEE Transactions on Industry Applications, 2007, 43(1): 238-244[119] Zhang Yan-Bing, Hang Da-Ming, Ma Zheng-Xin, Cao Zhi-Gang. A robust active queue management algorithm based on reinforcement learning. Journal of Software, 2004, 15(7): 1090-1098 (张雁冰, 杭大明, 马正新, 曹志刚. 基于再励学习的主动队列管理算法. 软件学报, 2004, 15(7): 1090-1098)[120] Liu D R, Zhang Y, Zhang H G. A self-learning call admission control scheme for CDMA cellular networks. IEEE Transactions on Neural Networks, 2005, 16(5): 1219-1228[121] Crites R H, Barto A G. Elevator group control using multiple reinforcement learning agents. Machine Learning, 1998, 33(2-3): 235-262[122] Zhang W, Dietterich T G. High-performance job-shop scheduling with a time-delay TD-λ network. In: Advances in Neural Information Processing Systems 8. Cambridge, MA: MIT Press, 1996. 1024-1030[123] Schaerf A, Shoham Y, Tennenholtz M. Adaptive load balancing: a study in multi-agent learning. Journal of Artificial Intelligence Research, 1995, 2: 475-500[124] Boyan J, Moore A W. Learning evaluation functions to improve optimization by local search. Journal of Machine Learning Research, 2001, 1: 77-112[125] Ghavamzadeh M, Mahadevan S. Hierarchical average reward reinforcement learning. Journal of Machine Learning Research, 2007, 8: 2629-2669[126] Barto A G, Mahadevan S. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems --- Theory and Applications, 2003, 13(1-2): 41-77[127] Shen Jing. Research on Hierarchical Reinforcement Learning [Ph.D. dissertation], Harbin Engineering University, China, 2006 (沈晶. 分层强化学习方法研究[博士学位论文], 哈尔滨工程大学, 中国, 2006)[128] Hengst B. Discovering Hierarchy in Reinforcement Learning [Ph.D. dissertation]. University of New South Wales, Australia, 2003[129] Xu X, Liu C M, Yang S X, Hu D W. Hierarchical approximate policy iteration with binary-tree state space decomposition. IEEE Transactions on Neural Networks, 2011, 22(12): 1863-1877[130] Deb A K, Jayadeva G M, Chandra S. SVM-based tree-type neural networks as a critic in adaptive critic designs for control. IEEE Transactions on Neural Networks, 2007, 18(4): 1016-1030[131] Abu-Khalaf M, Lewis F L. Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica, 2005, 41(5): 779-791[132] Abu-Khalaf M, Lewis F L, Huang J. Policy iterations on the Hamilton-Jacobi-Isaacs equation for H∞ state feedback control with input saturation. IEEE Transactions on Automatic Control, 2006, 51(12): 1989-1995[133] Abu-Khalaf M, Lewis F L, Huang J. Neurodynamic programming and zero-sum games for constrained control systems. IEEE Transactions on Neural Networks, 2008, 19(7): 1243-1252[134] Ong C S, Smola A J, Williamson R C. Learning the kernel with hyperkernels. Journal of Machine Learning Research, 2005, 6: 1043-1071[135] Mahadevan S, Maggioni M. Proto-value functions: a Laplacian framework for learning representation and control in Markov decision processes. Journal of Machine Learning Research, 2007, 8: 2169-2231[136] Sutton R S, Szepesvári C, Geramifard A, Bowling M. Dyna-style planning with linear function approximation and prioritized sweeping. In: Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence. Helsinki, Finland: AUAI Press, 2008. 528-536[137] Walsh T J, Goschin S, Littman M L. Integrating sample-based planning and model-based reinforcement learning. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence. Georgia, USA: AAAI Press, 2010. 612-617[138] Ng A Y, Harada D, Russell S. Policy invariance under reward transformations: theory and application to reward shaping. In: Proceedings of the 16th International Conference on Machine Learning. Bled, Slovenia: Morgan Kaufmann, 1999. 278-287[139] Wiewiora E. Potential-based shaping and Q-value initialization are equivalent. Journal of Artificial Intelligent Research, 2003, 19(1): 205-208[140] Laud A, DeJong G. Reinforcement learning and shaping: encouraging intended behaviors. In: Proceedings of the 19th International Conference on Machine Learning. Sydney, Australia: Morgan Kaufmann, 2002. 355-362[141] Ng A Y, Russell S J. Algorithms for inverse reinforcement learning. In: Proceedings of the 17th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann, 2000. 663-670[142] Saksida L M, Raymond S M, Touretsky D S. Shaping robot behavior using principles from instrumental conditioning. Robotics and Autonomous Systems, 1998, 22(3-4): 231-249[143] Ramachandran D, Amir E. Bayesian inverse reinforcement learning. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence. Hyderabad, India: AAAI Press, 2007. 2586-2591
点击查看大图
计量
- 文章访问数: 3149
- HTML全文浏览量: 71
- PDF下载量: 2323
- 被引次数: 0