基于无模型策略梯度强化学习的未知随机系统最优控制

杜城龙; 韩洁; 李繁飙; 桂卫华

doi:10.16383/j.aas.c250156

基于无模型策略梯度强化学习的未知随机系统最优控制

doi: 10.16383/j.aas.c250156 cstr: 32138.14.j.aas.c250156

1.
中南大学自动化学院长沙 410083

基金项目: 国家自然科学基金(62303492, 62533005, 62222317, 62473383), 湖南省自然科学基金(2025JJ40056, 2023JJ40765), 广东省基础与应用基础研究基金(2024A1515240069)资助

详细信息

作者简介:
杜城龙：中南大学自动化学院副教授. 2016年获得哈尔滨工业大学学士学位, 2022年获得中南大学博士学位. 主要研究方向为多智能体强化学习, 分布式控制与优化, 能源交通一体化和低空飞行器. E-mail: chenglong_du@csu.edu.cn

韩洁：中南大学自动化学院副教授. 2020年获得中南大学博士学位. 主要研究方向为复杂工业过程智能建模与优化, 工业大模型和智能决策方法. E-mail: hanjie@csu.edu.cn

李繁飙：中南大学自动化学院教授. 2015年获得哈尔滨工业大学博士学位. 主要研究方向为复杂工业过程智能控制与优化, 空天飞行器智能控制. 本文通信作者.E-mail: fanbiaoli@csu.edu.cn

桂卫华：中国工程院院士, 中南大学自动化学院教授. 主要研究方向为复杂工业过程建模、优化与控制应用. E-mail: gwh@csu.edu.cn

计量
- 文章访问数: 418
- HTML全文浏览量: 152
- PDF下载量: 120
- 被引次数: 0
出版历程
- 收稿日期: 2025-04-15
- 录用日期: 2025-06-19
- 网络出版日期: 2025-09-18
- 刊出日期: 2025-10-29

Model-free Policy Gradient-based Reinforcement Learning Algorithms for Optimal Control of Unknown Stochastic Systems

1.
School of Automation, Central South University, Changsha 410083

Funds: Supported by National Natural Science Foundation of China (62303492, 62533005, 62222317, 62473383), Natural Science Foundation of Hunan Province (2025JJ40056, 2023JJ40765), and Guangdong Basic and Applied Basic Research Foundation (2024A1515240069)

More Information

Author Bio:
DU Cheng-Long　Associate professor at the School of Automation, Central South University. He received his bachelor degree from Harbin Institute of Technology in 2016, and his Ph.D. degree from Central South University in 2022. His research interest covers multi-agent reinforcement learning, distributed control and optimization, energy and transportation integration, and low-altitude aircraft

HAN Jie　Associate professor at the School of Automation, Central South University. She received her Ph.D. degree from Central South University in 2020. Her research interest covers intelligent modeling and optimization of complex industrial process, industrial large-scale models, and intelligent decision-making methods

LI Fan-Biao　Professor at the School of Automation, Central South University. He received his Ph.D. degree from Harbin Institute of Technology in 2015. His research interest covers intelligent control and optimization of complex industrial process, and intelligent control of aerospace vehicle. Corresponding author of this paper

GUI Wei-Hua　Academician at Chinese Academy at Engineering, professor at the School of Automation, Central South University. His research interest covers modeling, optimization and control applications of complex industrial process

摘要

摘要: 针对一类未知动力学马尔科夫随机系统的最优控制问题, 提出两种无模型策略梯度强化学习算法. 首先, 针对模型信息部分未知的马尔科夫随机系统, 基于系统采样数据和耦合李雅普诺夫方程推导出无模型策略梯度的解析形式, 并提出一种部分无模型策略梯度强化学习最优控制算法, 实现对预设性能指标的直接最小化. 由于求解耦合李雅普诺夫方程和计算策略梯度的必要数据均可从系统采样数据同一轨迹提取, 而无需再额外收集采样数据, 降低了算法的采样复杂度. 进一步地, 为完全解除对马尔科夫随机系统模型信息的依赖, 通过随机摄动反馈增益估计策略梯度, 并提出一种完全无模型策略梯度强化学习算法, 实现了马尔科夫随机系统动力学完全未知情况下的最优控制. 最后, 通过仿真结果证明了本文所提两种无模型策略梯度强化学习最优控制算法的高效性与优越性.
- 随机系统 /
- 最优控制 /
- 未知动力学 /
- 策略梯度 /
- 强化学习
Abstract: This paper investigates the optimal control problem of a class of Markov stochastic jump systems (MSJSs) with unknown dynamics by two novel model-free policy gradient (PG)-based reinforcement learning (RL) algorithms. Firstly, for MSJSs with partially unknown model information, an analytical form of model-free PG is derived based on the sampling data of MSJSs and the solutions to coupled Lyapunov equations, and a partially model-free PG-based RL optimal control algorithm is proposed, where the predefined performance index is directly minimized. As the fact that the necessary data for solving the coupled Lyapunov equations and calculating the PG can be extracted from the same trajectory of the system sampling data, without the need to collect additional sampling data, the sampling complexity of the algorithm is significantly reduced. Furthermore, in order to completely eliminate the dependence on the model information of MSJSs, the PG is estimated through random perturbation feedback gain, and a completely model-free PG-based RL algorithm is proposed to achieve optimal control of MSJSs with completely unknown dynamics. Finally, simulation results are presented to demonstrate the efficiency and superiority of the proposed two model-free PG-based RL optimal control algorithms.
- Stochastic systems /
- optimal control /
- unknown dynamics /
- policy gradient /
- reinforcement learning

HTML全文

图 1 算法1的性能指标

Fig. 1 The performance index of algorithm 1

下载: 全尺寸图片幻灯片

图 2 算法1的控制增益矩阵$K_{i}(k)$

Fig. 2 The control gain matrix $K_{i}(k)$ of algorithm 1

下载: 全尺寸图片幻灯片

图 3 算法2的性能指标

Fig. 3 The performance index of algorithm 2

下载: 全尺寸图片幻灯片

图 4 算法2的控制增益矩阵$K_{i}(k)$

Fig. 4 The control gain matrix $K_{i}(k) $ of algorithm 2

下载: 全尺寸图片幻灯片

图 5 系统模态演化

Fig. 5 System mode evolution

下载: 全尺寸图片幻灯片

图 6 系统状态响应

Fig. 6 System state response

下载: 全尺寸图片幻灯片

表 1 符号说明

Table 1 Notations explanation

符号	说明
${\bf{R}}^{m}$, ${\bf{R}}^{m\times n}$	实数向量, 实数矩阵
${\rm{E}}\{\cdot\}$	数学统计期望
$\otimes$	克罗内克(Kronecker)积
${\bf{Z}}$, ${\bf{Z}}_{\ge 0}$	整数集, 非负整数集
$X> {\bf{0}}$, $X\geq{\bf{0}}$	正定矩阵$X$, 半正定矩阵$X$
$I_{n}$, ${\bf{0}}$	单位矩阵, 零矩阵
${\rm{vec}}(X)$	$[x_{1}^{{\mathrm{T}}},\; \cdots,\; x_{m}^{{\mathrm{T}}}]^{{\mathrm{T}}}\in {\bf{R}}^{n\times m}$, $X\in {\bf{R}}^{n\times m}$
${\rm{vecs}}(X)$	$[x_{11},\; \cdots,\; x_{1n},\; x_{22},\; \cdots,\; x_{2n},\; \cdots,\; x_{nn}]^{{\mathrm{T}}}\in$ ${\bf{R}}^{\frac{n(n+1)}{2}}$, $X\in{\bf{R}}^{n\times n}$
$\bar{x}$	$[x_{1}^{2},\; \cdots,\; 2x_{1}x_{n},\; x_{2}^{2},\; \cdots,\; 2x_{2}x_{n},\; \cdots,\; x_{n}^{2}]^{{\mathrm{T}}}\in$ $ {\bf{R}}^{\frac{n(n+1)}{2}}$, $x=[x_{1},\; \cdots,\; x_{n}]^{{\mathrm{T}}}\in{\bf{R}}^{n}$

下载: 导出CSV

表 2 算法1和算法2在采样效率方面的比较

Table 2 Comparison between algorithm 1 and algorithm 2 in sampling efficiency

算法	算法1	算法2
采样数量$N_{{{s}}}$	$10$	$10$
单条轨迹仿真时间$T_{t}\; ({\rm{s}})$	$10 $	$10 $
采样周期$T_{s}\; ({\rm{s}})$	$0. 001 $	$0. 001 $
总仿真时间$({\rm{s}})$	$N_{s}T_{t}=100$	$N_{s}T_{t}\frac{2T_{t}}{T_{s}}=2\;000\;000$

下载: 导出CSV

参考文献(38)

[1]	王鼎, 王将宇, 乔俊飞. 融合自适应评判的随机系统数据驱动策略优化. 自动化学报, 2024, 50(5): 980−990 Wang Ding, Wang Jiang-Yu, Qiao Jun-Fei. Data-driven policy optimization for stochastic systems involving adaptive critic. Acta Automatica Sinica, 2024, 50(5): 980−990
[2]	宋秀兰, 李洋阳, 何德峰. 外部干扰和随机DoS攻击下的网联车安全H_∞ 队列控制. 自动化学报, 2024, 50(2): 348−355 Song Xiu-Lan, Li Yang-Yang, He De-Feng. Secure H_∞ platooning control for connected vehicles subject to external disturbance and random DoS attacks. Acta Automatica Sinica, 2024, 50(2): 348−355
[3]	Kuppusamy S, Joo Y H, Kim H S. Asynchronous control for discrete-time hidden Markov jump power systems. IEEE Transactions on Cybernetics, 2022, 52(9): 9943−9948 doi: 10.1109/TCYB.2021.3062672
[4]	满景涛, 曾志刚, 盛银, 来金钢. 基于ODE-PDE的大规模多智能体系统有限时间编队. 自动化学报, 2025, 51(3): 631−642 Man Jing-Tao, Zeng Zhi-Gang, Sheng Yin, Lai Jin-Gang. Finite-time formation of large-scale multi-agent systems based on an ODE-PDE approach. Acta Automatica Sinica, 2025, 51(3): 631−642
[5]	Cui R H, Xie X J. Finite-time stochastic integral input-to-state stability and its applications. Automatica, 2023, 158: Article No. 111311 doi: 10.1016/j.automatica.2023.111311
[6]	王乐, 崔凯, 蒋秀珊, 赵东亚, 张维海. 线性Markov跳变随机系统的Pareto最优控制. 控制理论与应用, 2025, 42(1): 59−66 Wang Le, Cui Kai, Jiang Xiu-Shan, Zhao Dong-Ya, Zhang Wei-Hai. Pareto optimal control of linear Markov jump stochastic systems. Control Theory & Applications, 2025, 42(1): 59−66
[7]	Li F B, Du C L, Yang C H, Gui W H. Passivity-based asynchronous sliding mode control for delayed singular Markovian jump systems. IEEE Transactions on Automatic Control, 2018, 63(8): 2715−2721 doi: 10.1109/TAC.2017.2776747
[8]	李桂林, 王传锐, 季海波. 具有Markov跳跃参数的一类随机非线性系统逆最优增益设计. 自动化学报, 2014, 40(7): 1285−1294 Li Gui-Lin, Wang Chuan-Rui, Ji Hai-Bo. Inverse optimal gain assignment control for a class of stochastic nonlinear systems with Markovian jump parameters. Acta Automatica Sinica, 2014, 40(7): 1285−1294
[9]	刘越, 周平. 马尔可夫跳变线性系统最优控制的研究现状与进展. 信息与控制, 2022, 51(1): 54−68 Liu Yue, Zhou Ping. Recent status and progress in optimal control of Markov jump linear systems. Information and Control, 2022, 51(1): 54−68
[10]	Xiong J L, Lam J, Gao H J, Ho D W C. On robust stabilization of Markovian jump systems with uncertain switching probabilities. Automatica, 2005, 41(5): 897−903 doi: 10.1016/j.automatica.2004.12.001
[11]	Zhang Y, He Y, Wu M, Zhang J. Stabilization for Markovian jump systems with partial information on transition probability based on free-connection weighting matrices. Automatica, 2011, 47(1): 79−84 doi: 10.1016/j.automatica.2010.09.009
[12]	Zhang K, Su R, Zhang H G. A novel resilient control scheme for a class of Markovian jump systems with partially unknown information. IEEE Transactions on Cybernetics, 2022, 52(8): 8191−8200 doi: 10.1109/TCYB.2021.3050619
[13]	Tian E G, Yue D, Wei G L. Robust control for Markovian jump systems with partially known transition probabilities and nonlinearities. Journal of the Franklin Institute, 2013, 350(8): 2069−2083 doi: 10.1016/j.jfranklin.2013.05.011
[14]	Qi W H, Hou Y K, Park J H, Zong G D, Cao J D, Cheng J. SMC for discrete-time networked semi-Markovian switching systems with random DoS attacks and applications. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(3): 1982−1993 doi: 10.1109/TSMC.2022.3211322
[15]	王国庆, 姚凤麒. 具有有界时滞的脉冲随机泛函微分系统的有限时间稳定. 控制理论与应用, 2023, 40(9): 1569−1575 Wang Guo-Qing, Yao Feng-Qi. Finite-time stability of impulsive stochastic functional differential systems with bounded delays. Control Theory & Applications, 2023, 40(9): 1569−1575
[16]	Wei Y L, Park J H, Qiu J B, Wu L G, Jung H Y. Sliding mode control for semi-Markovian jump systems via output feedback. Automatica, 2017, 81: 133−141 doi: 10.1016/j.automatica.2017.03.032
[17]	Wu L G, Ho D W C. Sliding mode control of singular stochastic hybrid systems. Automatica, 2010, 46(4): 779−783 doi: 10.1016/j.automatica.2010.01.010
[18]	Liu M, Zhang L X, Shi P, Zhao Y X. Sliding mode control of continuous-time Markovian jump systems with digital data transmission. Automatica, 2017, 80: 200−209 doi: 10.1016/j.automatica.2017.02.002
[19]	Li F B, Du C L, Yang C H, Wu L G, Gui W H. Finite-time asynchronous sliding mode control for Markovian jump systems. Automatica, 2019, 109: Article No. 108503 doi: 10.1016/j.automatica.2019.108503
[20]	Du C L, Yang C H, Li F B, Gui W H. A novel asynchronous control for artificial delayed Markovian jump systems via output feedback sliding mode approach. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2019, 49(2): 364−374 doi: 10.1109/TSMC.2018.2815032
[21]	Li Z, Zhang Y, Wu A G. An inversion-free iterative algorithm for Riccati matrix equations in discrete-time Markov jump systems. IEEE Transactions on Automatic Control, 2022, 67(9): 4754−4761 doi: 10.1109/TAC.2022.3172266
[22]	Dong S L, Liu L, Feng G, Liu M Q, Wu Z G, Zheng R H. Cooperative output regulation quadratic control for discrete-time heterogeneous multiagent Markov jump systems. IEEE Transactions on Cybernetics, 2022, 52(9): 9882−9892 doi: 10.1109/TCYB.2021.3110792
[23]	Jansch-Porto J P, Hu B, Dullerud G E. Policy optimization for Markovian jump linear quadratic control: Gradient method and global convergence. IEEE Transactions on Automatic Control, 2023, 68(4): 2475−2482 doi: 10.1109/TAC.2022.3176439
[24]	Jansch-Porto J P, Hu B, Dullerud G E. Convergence guarantees of policy optimization methods for Markovian jump linear systems. In: Proceedings of the American Control Conference (ACC). Denver, USA: IEEE, 2020. 2882−2887
[25]	Chávez-Fuentes J R, Costa E F, Terra M H, Rocha K D T. The linear quadratic optimal control problem for discrete-time Markov jump linear singular systems. Automatica, 2021, 127: Article No. 109506 doi: 10.1016/j.automatica.2021.109506
[26]	Lai J, Xiong J L, Shu Z. Model-free optimal control of discrete-time systems with additive and multiplicative noises. Automatica, 2023, 147: Article No. 110685 doi: 10.1016/j.automatica.2022.110685
[27]	Mohammadi H, Zare A, Soltanolkotabi M, Jovanović M R. Convergence and sample complexity of gradient methods for the model-free linear-quadratic regulator problem. IEEE Transactions on Automatic Control, 2022, 67(5): 2435−2450 doi: 10.1109/TAC.2021.3087455
[28]	Bian T, Jiang Z P. Value iteration and adaptive dynamic programming for data-driven adaptive optimal control design. Automatica, 2016, 71: 348−360 doi: 10.1016/j.automatica.2016.05.003
[29]	李艺春, 刘泽娇, 洪艺天, 王继超, 王健瑞, 李毅, 等. 基于多智能体强化学习的博弈综述. 自动化学报, 2025, 51(3): 540−558 Li Yi-Chun, Liu Ze-Jiao, Hong Yi-Tian, Wang Ji-Chao, Wang Jian-Rui, Li Yi, et al. Multi-agent reinforcement learning based game: A survey. Acta Automatica Sinica, 2025, 51(3): 540−558
[30]	袁雷, 张子谦, 李立和, 管聪, 俞扬. 开放环境下的协作多智能体强化学习进展. 中国科学: 信息科学, 2025, 55(2): 217−268 Yuan Lei, Zhang Zi-Qian, Li Li-He, Guan Cong, Yu Yang. Progress on cooperative multi-agent reinforcement learning in open environment. Scientia Sinica Informationis, 2025, 55(2): 217−268
[31]	Wang J, Zhao W, Cao J D, Park J H, Shen H. Reinforcement learning-based predefined-time tracking control for nonlinear systems under identifier-critic-actor structure. IEEE Transactions on Cybernetics, 2024, 54(11): 6345−6357 doi: 10.1109/TCYB.2024.3431670
[32]	Zhang Z Y, Mo Z B, Chen Y T, Huang J. Reinforcement learning behavioral control for nonlinear autonomous system. IEEE/CAA Journal of Automatica Sinica, 2022, 9(9): 1561−1573 doi: 10.1109/JAS.2022.105797
[33]	Gravell B, Esfahani P M, Summers T. Learning optimal controllers for linear systems with multiplicative noise via policy gradient. IEEE Transactions on Automatic Control, 2021, 66(11): 5283−5298 doi: 10.1109/TAC.2020.3037046
[34]	Ji Y D, Chizeck H J. Controllability, stabilizability, and continuous-time Markovian jump linear quadratic control. IEEE Transactions on Automatic Control, 1990, 35(7): 777−788 doi: 10.1109/9.57016
[35]	Qian Y Y, Pang W J. An implicit sequential algorithm for solving coupled Lyapunov equations of continuous-time Markovian jump systems. Automatica, 2015, 60: 245−250 doi: 10.1016/j.automatica.2015.07.011
[36]	Konda V R, Tsitsiklis J N. Actor-critic algorithms. In: Proceedings of the 13th International Conference on Neural Information Processing Systems (NIPS). Denver, USA: MIT Press, 1999. 1008−1014
[37]	Gao W N, Mynuddin M, Wunsch D C, Jiang Z P. Reinforcement learning-based cooperative optimal output regulation via distributed adaptive internal model. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(10): 5229−5240 doi: 10.1109/TNNLS.2021.3069728
[38]	Borno I. Parallel computation of the solutions of coupled algebraic Lyapunov equations. Automatica, 1995, 31(9): 1345−1347 doi: 10.1016/0005-1098(95)00037-W