-
摘要: 对于部分可观测环境下的多智能体交流协作任务, 现有工作大多只利用了当前时刻的网络隐藏层信息, 限制了信息的来源. 本文研究如何使用团队奖励训练一组独立的策略以及如何提升这组独立策略的协同表现, 提出了多智能体注意力意图交流算法, 增加了意图信息模块来扩大交流信息的来源, 并且改善了交流模式. 本文将智能体历史上表现最优的网络作为意图网络, 且从中提取策略意图信息, 按时间顺序保留成一个向量, 最后结合注意力机制推断出更为有效的交流信息. 本文在星际争霸环境上通过实验对比分析, 验证了算法的有效性.Abstract: A agent uses the hidden layer information of the network at the current moment, and cannot guarantee the adequacy and effectiveness of the communication information. A multi-agent attentional intention communication algorithm is proposed, the intention information module is added to enlarge the source of communication information, and the communication mode is improved. Considering that the best-performing network in the historical moment of an agent is capable of showing the strategic intention information of other agents. The historical intention information of the agent that performs best at all times is retained as a vector in chronological order, and combined with the attention mechanism and current observation history information to extract more effective information as input for decision-making. The effectiveness of the algorithm is verified by experimental comparison and analysis on StarCraft Multi-Agent Challenge (SMAC).
-
Key words:
- Multi-Agent /
- Reinforcement Learning /
- Intention Communication /
- Attention Mechanism
-
表 1 SMAC实验场景
Table 1 Experimental scenarios under SMAC
场景名字 我方单位 敌方单位 类型 5m_vs_6m 5名海军陆战队 6名海军陆战队 同构但不对称 3s_vs_5z 3潜行者 5狂热者 微型技巧: 风筝 2s_vs_1sc 2缠绕者 1脊柱爬行者 微技巧交火 3s5z 3潜行者&5狂热者 3潜行者&5狂热者 异构且对称 6h_vs_8z 6蛇蝎 8狂热者 微招: 集中火力 表 2 所测试算法的最大中值实验结果
Table 2 Maximum median performace % of the algorithms tested
场景 MAAIC-VDN VDN IQL MAAIC-QMIX QMIX Heuristic MAAIC-QTRAN QTRAN 2s_vs_1sc 100 100 100 100 100 0 100 100 3s5z 90 87 9 97 91 42 31 20 5m_vs_6m 87 78 59 74 75 0 67 58 3s_vs_5z 98 73 46 98 97 0 97 15 6h_vs_8z 55 0 0 31 3 0 22 0 表 3 基于算法MAAIC-VDN不同意图单元数的GPU内存开销
Table 3 GPU memory cost for different number of intention units based on MAAIC-VDN algorithm
场景 Five Intention Nets Three Intention Nets One Intention Net TarMAC VDN 2s_vs_1sc 1560M 1510M 1470M 1120M 680M 5m_vs_6m 1510M 1500M 1500M 1150M 680M 3s_vs_5z 2120M 2090M 2100M 1480M 730M 表 4 多意图交流学习算法网络参数
Table 4 Network parameters of multi intention communication learning algorithm
参数名 设置值 说明 Rnn_hidden_dim 64 对于局部观测的全连接特征编码维度, 以及循环网络的隐藏层维度 Attention_dim1 64 意图信息的注意力编码维度 Attention_dim2 64*8 多头注意力机制的编码维度 N_intention 3 意图网络的个数 表 5 多意图交流学习算法训练参数
Table 5 Training parameters of multi intention communication learning algorithm
参数名 设置值 说明 Lr 0.0005 损失函数的学习率 Optim_eps 0.00001 RMSProp加到分母提升数值稳定性 Epsilon 1 探索的概率值 Min_epsilon 0.05 最低探测概率值 Anneal_steps 50000 模拟退火的步数 Epsilon_anneal_scale step 探索概率值的退火方式 N_epoch 20000 训练的总轮数 N_episodes 1 一次epoch采样episode的数目 Evaluate_cycle 100 评估周期间隔 Evaluate_epoch 20 评估次数 Batch_size 32 训练的批数据大小 Buffer_size 5000 内存池大小 Target_update_cycle 200 目标网络更新间隔 Grad_norm_clip 10 梯度裁剪, 防止梯度爆炸 表 6 多意图交流学习算法训练参数
Table 6 Training parameters of multi intention communication learning algorithm
参数名 设置值 说明 Training step 3000000 训练最大步数 Learning rate 0.0005 Adam优化的学习率 Replay buffer size 600000 最大的样本存储数量 Minibatch size_epsilon 32 更新参数所用到的样本数量 Anneal_steps 500000 模拟退火的步数 $\alpha$ 1 外在奖励的系数 $\beta $ 0.5 内在奖励的系数 -
[1] Kurach K, Raichuk A, Stańczyk P, Zajac M, Bachem O, Espeholt L, Riquelme C, Vincent D, Michalski M, Bousquet O. Google research football: A novel reinforcement learning environment. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34: 4501−4510. [2] Ye D, Liu Z, Sun M, et al. Mastering complex control in moba games with deep reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020. 34(04): 6672−6679. [3] Ohmer X, Marino M, Franke M, & K?nig P. Why and how to study the impact of perception on language emergence in artificial agents. In: Proceedings of the Annual Meeting of the Cognitive Science Society. 2021. 43(43). [4] 姚红革, 张玮, 杨浩琪, 喻钧. 深度强化学习联合回归目标定位. 自动化学报, 2020, 41: 1-10Yao Hong-Ge, Zhang Wei, Yang Hao-Qi, Yu Jun. Joint regression object localization based on deep reinforcement learning. Acta Automatica Sinica, 2020, 41: 1-10 [5] 吴晓光, 刘绍维, 杨磊, 邓文强, 贾哲恒. 基于深度强化学习的双足机器人斜坡步态控制方法. 自动化学报, 2020, 46: 1-12Wu Xiao-Guang, Liu Shao-Wei, Yang Lei, Deng Wen-Qiang, Jia Zhe-Heng. A gait control method for biped robot on slope based on deep reinforcement learning. Acta Automatica Sinica, 2020, 46: 1-12 [6] 孙长银, 穆朝絮. 多智能体深度强化学习的若干关键科学问题. 自动化学报, 2020, 46(7): 71-79Sun Chang-Yin, Mu Chao-Xu. Important scientific problems of multi-agent deep reinforcement learning. Acta Automatica Sinica. 2020, 46(7): 71-79. [7] Lillicrap T P, Hunt J J, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. Continuous control with deep reinforcement learning. 2015, to be published. [8] Tampuu A, Matiisen T, Kodelja D, Kuzovkin I, Korjus K, Aru J, Aru J, Vicente R. Multiagent cooperation and competition with deep reinforcement learning. 2017, 12(4). [9] Sunehag P, Lever G, Gruslys A, Czarnecki W M, Zambaldi V, Jaderberg M, Lanctot M, Sonnerat N, Leibo J Z, Tuyls K, et al.. Value-decomposition networks for cooperative multi-agent learning. 2017, to be published. [10] Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In: International Conference on Machine Learning, PMLR. 2018. 4295−4304. [11] Whiteson S. Weighted qmix: Expanding monotonic value function factorisation for deep multi- agent reinforcement learning. 2020, to be published. [12] Wang J H, Ren Z, Liu T, Yu Y, Zhang C. Qplex: Duplex dueling multi-agent q-learning. 2020, to be published. [13] Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation. 2014, to be published. [14] Gupta J K, Egorov M, Kochenderfer M. Cooperative multi-agent control using deep reinforcement learning. In: International Conference on Autonomous Agents and Multiagent Systems. Springer, 2017. 66−83. [15] Busoniu L, Babuska R, De Schutter B. Multi-agent reinforcement learning: A survey. In: 2006 9th International Conference on Control, Automation, Robotics and Vision. IEEE, 2006. 1−6. [16] Hernandez-Leal P, Kartal B, Taylor M E. A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems, 2019, 33(6): 750-797. doi: 10.1007/s10458-019-09421-1 [17] Tan M. Multi-agent reinforcement learning: Independent vs. cooperative agents. In: Proceedings of the tenth international conference on machine learning, 1993. 330−337. [18] Hernandez-Leal P, Kartal B, Taylor M E. Is multiagent deep reinforcement learning the answer or the question? a brief survey. Learning, 2018, 21:22. [19] OroojlooyJadid A, Hajinezhad D. A review of cooperative multi-agent deep reinforcement learning. 2019, to be published. [20] Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I. Multi-agent actor-critic for mixed cooperative-competitive environments. 2017, to be published. [21] Pesce E, Montana G. Improving coordination in small-scale multi-agent deep reinforcement learning through memory-driven communication. Machine Learning, 2020, 1-21. [22] Kim W, Cho M, Sung Y. Message-dropout: An efficient training method for multi-agent deep reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33. 6079−6086. [23] Foerster J, Farquhar G, Afouras T, Nardelli N, Whiteson S, Counterfactual multi-agent policy gradients. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2018, 32. [24] Son K, Kim D, Kang W J, Hostallero D E, Yi Y. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In: International Conference on Machine Learning. PMLR, 2019. 5887−5896. [25] Yang Y, Hao J, Liao B, Shao K, Chen G, Liu W, Tang H. Qatten: A general framework for cooperative multiagent reinforcement learning. 2020, to be published. [26] Yang Y, Hao J, Chen G, Tang H, Chen Y, Hu Y, Fan C, Wei Z. Q-value path decomposition for deep multiagent reinforcement learning. In: International Conference on Machine Learning. PMLR, 2020. 10706−10715. [27] Foerster J N, Assael Y M, De Freitas N, Whiteson. Learning to communicate with deep multi-agent reinforcement learning. 2016, to be published. [28] Sukhbaatar S, Szlam A, Fergus R. Learning multiagent communication with backpropagation. 2016, to be published. [29] Peng P, Wen Y, Yang Y, Yuan Q, Tang Z, Long H, Wang J. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play starcraft combat games. 2017, to be published. [30] Singh A, Jain T, Sukhbaatar S. Learning when to communicate at scale in multiagent cooperative and competitive tasks. 2018, to be published. [31] Fu J, Li W, Du J, Huang Y. A multiscale residual pyramid attention network for medical image fusion. Biomedical Signal Processing and Control, 2021, 66: 102488. doi: 10.1016/j.bspc.2021.102488 [32] Locatello F, Weissenborn D, Unterthiner T, Mahendran A, Heigold G, Uszkoreit J, Dosovitskiy A, Kipf T. Object-centric learning with slot attention. 2020, to be published. [33] Jiang J, Lu Z. Learning attentional communication for multi-agent cooperation. 2018, to be published. [34] Das A, Gervet T, Romoff J, Batra D, Parikh D, Rabbat M, Pineau J. Tarmac: Targeted multi-agent communication. In: International Conference on Machine Learning. PMLR, 2019. 1538−1546. [35] Liu Y, Wang W, Hu Y, Hao J, Chen X, Gao Y. Multi-agent game abstraction via graph attention neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34. 7211−7218. [36] Raileanu R, Denton E, Szlam A, Fergus R. Modeling others using oneself in multi-agent reinforcement learning. In: International conference on machine learning. PMLR, 2018. 4257−4266. [37] Jaques N, Lazaridou A, Hughes E, Gulcehre C, Ortega P, Strouse D, Leibo J Z, De Freitas N. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In: International Conference on Machine Learning. PMLR, 2019. 3040−3049. [38] Littman M L. Markov games as a framework for multi-agent reinforcement learning. In: Machine learning proceedings. Elsevier, 1994. 157−163. [39] Samvelyan M, Rashid T, De Witt C S, Farquhar G, Nardelli N, Rudner T G, Hung C M, Torr P H, Foerster J, Whiteson S. The starcraft multi-agent challenge. 2019, to be published. [40] Yu W W, Wang R, Li R Y, Gao J, Hu X H. Historical best q-networks for deep reinforcement learning. In: 2018 IEEE 30th International Conference on Tools with Artificial Intelligence. IEEE, 2018. 6−11. [41] Anschel O, Baram N, Shimkin N. Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning. In: International Conference on Machine Learning. PMLR, 2017. 176−185. -

计量
- 文章访问数: 392
- HTML全文浏览量: 237
- 被引次数: 0