2.624

2020影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

面向多智能体协作的注意力意图与交流学习方法

俞文武 杨晓亚 李海昌 王瑞 胡晓惠

俞文武, 杨晓亚, 李海昌, 王瑞, 胡晓惠. 面向多智能体协作的注意力意图与交流学习方法. 自动化学报, 2021, 47(x): 1−16 doi: 10.16383/j.aas.c210430
引用本文: 俞文武, 杨晓亚, 李海昌, 王瑞, 胡晓惠. 面向多智能体协作的注意力意图与交流学习方法. 自动化学报, 2021, 47(x): 1−16 doi: 10.16383/j.aas.c210430
Yu Wen-Wu, Yang Xiao-Ya, Li Hai-Chang, Wang Rui, Hu Xiao-Hui. Attentional intention and communication for multi-agent learning. Acta Automatica Sinica, 2021, 47(x): 1−16 doi: 10.16383/j.aas.c210430
Citation: Yu Wen-Wu, Yang Xiao-Ya, Li Hai-Chang, Wang Rui, Hu Xiao-Hui. Attentional intention and communication for multi-agent learning. Acta Automatica Sinica, 2021, 47(x): 1−16 doi: 10.16383/j.aas.c210430

面向多智能体协作的注意力意图与交流学习方法

doi: 10.16383/j.aas.c210430
基金项目: 国家重点研发计划(2019YFB1405100), 国家自然科学基金(61802380, 61802016)资助
详细信息
    作者简介:

    俞文武:中国科学院软件研究所博士研究生. 2016年获得湖南大学学士学位. 主要研究方向为深度强化学习. E-mail: wenwu2016@iscas.ac.cn

    杨晓亚:中国科学院软件研究所硕士研究生. 2017年获得吉林大学计算机科学与技术学士学位, 主要研究方向为强化学习. E-mail: 642655800@qq.com

    李海昌:中国科学院软件研究所副研究员. 2016年获得中国科学院自动化研究所博士学位. 主要研究方向为计算机视觉, 模式识别和深度学习. E-mail: haichang@iscas.ac.cn

    王瑞:中国科学院软件研究所工程师. 2012年获得山东大学计算机软件与理论硕士学位. 主要研究方向为智能信息处理. E-mail: wangrui@iscas.ac.cn

    胡晓惠:中国科学院软件研究所研究员. 2003年获得北京航空航天大学博士学位. 主要研究方向为智能信息处理与系统集成. E-mail: hxh@iscas.ac.cn

Attentional Intention and Communication for Multi-Agent Learning

Funds: Supported by National Key Research and Development Program of China (2019YFB1405100), National Natural Science Foundation of China (61802380, 61802016)
More Information
    Author Bio:

    YU Wen-Wu Ph. D. candidate, Institute of Software, Chinese Academy of Sciences. He received her bachelor degree from Hunan University in 2016. His research interest is deep reinforcement learning

    YANG Xiao-Ya Master's degree candidate at the Institute of Software, Chinese Academy of Sciences. She received her bachelor's degree in computer science and technology from Jilin University in 2017. Her research interest is reinforcement learning

    LI Hai-Chang Associated professor at the Institute of Software, Chinese Academy of Sciences. He received his Ph. D. degree from Institute of Automation, Chinese Academy of Sciences in 2016. His research interest covers computer vision, pattern recognition and deep learning

    WANG Rui Engineer at the Institute of Software, Chinese Academy of Sciences. She received her master degree from Shandong University in 2012. Her research interest covers intelligent information processing

    HU Xiao-Hui Professor at the Institute of Software, Chinese Academy of Sciences. He received his doctor's degree from Beihang University in 2003. His research interest covers intelligent information processing and system integration

  • 摘要: 对于部分可观测环境下的多智能体交流协作任务, 现有工作大多只利用了当前时刻的网络隐藏层信息, 限制了信息的来源. 本文研究如何使用团队奖励训练一组独立的策略以及如何提升这组独立策略的协同表现, 提出了多智能体注意力意图交流算法, 增加了意图信息模块来扩大交流信息的来源, 并且改善了交流模式. 本文将智能体历史上表现最优的网络作为意图网络, 且从中提取策略意图信息, 按时间顺序保留成一个向量, 最后结合注意力机制推断出更为有效的交流信息. 本文在星际争霸环境上通过实验对比分析, 验证了算法的有效性.
  • 图  1  多意图交流算法整体框架图

    Fig.  1  Overall framework of multi-agent intention and communication algorithm

    图  2  对意图网络进行自注意力信息的提取. 其中Linear(Q)表示Q为一个线性层, K与V也是线性层

    Fig.  2  Extracting self attention information from intention network

    图  3  集中式训练分布式执行的多智能体同环境交互图

    Fig.  3  Multi-agent interaction with environment under CTDE

    图  4  交流通道使用的多头注意力模型

    Fig.  4  Multi-head attention model used in communication channels

    图  5  多意图交流学习算法MAAIC-VDN在SMAC上的实验结果

    Fig.  5  Experimental results of MAAIC-VDN algorithm on SMAC

    图  6  多智能体注意力意图交流学习算法MAAIC-QMIX在SMAC上的实验结果

    Fig.  6  Experimental results of MAAIC-QMIX algorithm on SMAC

    图  7  多智能体注意力意图交流学习算法MAAIC-QTRAN在SMAC上的实验结果

    Fig.  7  Experimental results of MAAIC-QTRAN algorithm on SMAC

    图  8  交流结构消融性实验结果

    Fig.  8  Experimental ablation results of the communication structure

    图  9  意图单元的数量消融性实验结果

    Fig.  9  Experimental ablation results of the number of itention units

    图  10  历史最优网络与历史最近邻网络作为意图网络消融性实验结果

    Fig.  10  Experimental ablation results of MAAIC with BQNet and NQNet

    图  11  基于算法MAAIC-VDN不同意图单元数的时间开销

    Fig.  11  Time cost for different number of intention units based on MAAIC-VDN algorithm

    图  12  内在意图奖励实验结果

    Fig.  12  Experimental results of intrinsic intention rewards

    表  1  SMAC实验场景

    Table  1  Experimental scenarios under SMAC

    场景名字我方单位敌方单位类型
    5m_vs_6m5名海军陆战队6名海军陆战队同构但不对称
    3s_vs_5z3潜行者5狂热者微型技巧: 风筝
    2s_vs_1sc2缠绕者1脊柱爬行者微技巧交火
    3s5z3潜行者&5狂热者3潜行者&5狂热者异构且对称
    6h_vs_8z6蛇蝎8狂热者微招: 集中火力
    下载: 导出CSV

    表  2  所测试算法的最大中值实验结果

    Table  2  Maximum median performace % of the algorithms tested

    场景MAAIC-VDNVDNIQLMAAIC-QMIXQMIXHeuristicMAAIC-QTRANQTRAN
    2s_vs_1sc1001001001001000100100
    3s5z908799791423120
    5m_vs_6m877859747506758
    3s_vs_5z987346989709715
    6h_vs_8z55003130220
    下载: 导出CSV

    表  3  基于算法MAAIC-VDN不同意图单元数的GPU内存开销

    Table  3  GPU memory cost for different number of intention units based on MAAIC-VDN algorithm

    场景Five Intention NetsThree Intention NetsOne Intention NetTarMACVDN
    2s_vs_1sc1560M1510M1470M1120M680M
    5m_vs_6m1510M1500M1500M1150M680M
    3s_vs_5z2120M2090M2100M1480M730M
    下载: 导出CSV

    表  4  多意图交流学习算法网络参数

    Table  4  Network parameters of multi intention communication learning algorithm

    参数名设置值说明
    Rnn_hidden_dim64对于局部观测的全连接特征编码维度, 以及循环网络的隐藏层维度
    Attention_dim164意图信息的注意力编码维度
    Attention_dim264*8多头注意力机制的编码维度
    N_intention3意图网络的个数
    下载: 导出CSV

    表  5  多意图交流学习算法训练参数

    Table  5  Training parameters of multi intention communication learning algorithm

    参数名设置值说明
    Lr0.0005损失函数的学习率
    Optim_eps0.00001RMSProp加到分母提升数值稳定性
    Epsilon1探索的概率值
    Min_epsilon0.05最低探测概率值
    Anneal_steps50000模拟退火的步数
    Epsilon_anneal_scalestep探索概率值的退火方式
    N_epoch20000训练的总轮数
    N_episodes1一次epoch采样episode的数目
    Evaluate_cycle100评估周期间隔
    Evaluate_epoch20评估次数
    Batch_size32训练的批数据大小
    Buffer_size5000内存池大小
    Target_update_cycle200目标网络更新间隔
    Grad_norm_clip10梯度裁剪, 防止梯度爆炸
    下载: 导出CSV

    表  6  多意图交流学习算法训练参数

    Table  6  Training parameters of multi intention communication learning algorithm

    参数名设置值说明
    Training step3000000训练最大步数
    Learning rate0.0005Adam优化的学习率
    Replay buffer size600000最大的样本存储数量
    Minibatch size_epsilon32更新参数所用到的样本数量
    Anneal_steps500000模拟退火的步数
    $\alpha$1外在奖励的系数
    $\beta $0.5内在奖励的系数
    下载: 导出CSV
  • [1] Kurach K, Raichuk A, Stańczyk P, Zajac M, Bachem O, Espeholt L, Riquelme C, Vincent D, Michalski M, Bousquet O. Google research football: A novel reinforcement learning environment. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34: 4501−4510.
    [2] Ye D, Liu Z, Sun M, et al. Mastering complex control in moba games with deep reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020. 34(04): 6672−6679.
    [3] Ohmer X, Marino M, Franke M, & K?nig P. Why and how to study the impact of perception on language emergence in artificial agents. In: Proceedings of the Annual Meeting of the Cognitive Science Society. 2021. 43(43).
    [4] 姚红革, 张玮, 杨浩琪, 喻钧. 深度强化学习联合回归目标定位. 自动化学报, 2020, 41: 1-10

    Yao Hong-Ge, Zhang Wei, Yang Hao-Qi, Yu Jun. Joint regression object localization based on deep reinforcement learning. Acta Automatica Sinica, 2020, 41: 1-10
    [5] 吴晓光, 刘绍维, 杨磊, 邓文强, 贾哲恒. 基于深度强化学习的双足机器人斜坡步态控制方法. 自动化学报, 2020, 46: 1-12

    Wu Xiao-Guang, Liu Shao-Wei, Yang Lei, Deng Wen-Qiang, Jia Zhe-Heng. A gait control method for biped robot on slope based on deep reinforcement learning. Acta Automatica Sinica, 2020, 46: 1-12
    [6] 孙长银, 穆朝絮. 多智能体深度强化学习的若干关键科学问题. 自动化学报, 2020, 46(7): 71-79

    Sun Chang-Yin, Mu Chao-Xu. Important scientific problems of multi-agent deep reinforcement learning. Acta Automatica Sinica. 2020, 46(7): 71-79.
    [7] Lillicrap T P, Hunt J J, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. Continuous control with deep reinforcement learning. 2015, to be published.
    [8] Tampuu A, Matiisen T, Kodelja D, Kuzovkin I, Korjus K, Aru J, Aru J, Vicente R. Multiagent cooperation and competition with deep reinforcement learning. 2017, 12(4).
    [9] Sunehag P, Lever G, Gruslys A, Czarnecki W M, Zambaldi V, Jaderberg M, Lanctot M, Sonnerat N, Leibo J Z, Tuyls K, et al.. Value-decomposition networks for cooperative multi-agent learning. 2017, to be published.
    [10] Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In: International Conference on Machine Learning, PMLR. 2018. 4295−4304.
    [11] Whiteson S. Weighted qmix: Expanding monotonic value function factorisation for deep multi- agent reinforcement learning. 2020, to be published.
    [12] Wang J H, Ren Z, Liu T, Yu Y, Zhang C. Qplex: Duplex dueling multi-agent q-learning. 2020, to be published.
    [13] Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation. 2014, to be published.
    [14] Gupta J K, Egorov M, Kochenderfer M. Cooperative multi-agent control using deep reinforcement learning. In: International Conference on Autonomous Agents and Multiagent Systems. Springer, 2017. 66−83.
    [15] Busoniu L, Babuska R, De Schutter B. Multi-agent reinforcement learning: A survey. In: 2006 9th International Conference on Control, Automation, Robotics and Vision. IEEE, 2006. 1−6.
    [16] Hernandez-Leal P, Kartal B, Taylor M E. A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems, 2019, 33(6): 750-797. doi: 10.1007/s10458-019-09421-1
    [17] Tan M. Multi-agent reinforcement learning: Independent vs. cooperative agents. In: Proceedings of the tenth international conference on machine learning, 1993. 330−337.
    [18] Hernandez-Leal P, Kartal B, Taylor M E. Is multiagent deep reinforcement learning the answer or the question? a brief survey. Learning, 2018, 21:22.
    [19] OroojlooyJadid A, Hajinezhad D. A review of cooperative multi-agent deep reinforcement learning. 2019, to be published.
    [20] Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I. Multi-agent actor-critic for mixed cooperative-competitive environments. 2017, to be published.
    [21] Pesce E, Montana G. Improving coordination in small-scale multi-agent deep reinforcement learning through memory-driven communication. Machine Learning, 2020, 1-21.
    [22] Kim W, Cho M, Sung Y. Message-dropout: An efficient training method for multi-agent deep reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33. 6079−6086.
    [23] Foerster J, Farquhar G, Afouras T, Nardelli N, Whiteson S, Counterfactual multi-agent policy gradients. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2018, 32.
    [24] Son K, Kim D, Kang W J, Hostallero D E, Yi Y. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In: International Conference on Machine Learning. PMLR, 2019. 5887−5896.
    [25] Yang Y, Hao J, Liao B, Shao K, Chen G, Liu W, Tang H. Qatten: A general framework for cooperative multiagent reinforcement learning. 2020, to be published.
    [26] Yang Y, Hao J, Chen G, Tang H, Chen Y, Hu Y, Fan C, Wei Z. Q-value path decomposition for deep multiagent reinforcement learning. In: International Conference on Machine Learning. PMLR, 2020. 10706−10715.
    [27] Foerster J N, Assael Y M, De Freitas N, Whiteson. Learning to communicate with deep multi-agent reinforcement learning. 2016, to be published.
    [28] Sukhbaatar S, Szlam A, Fergus R. Learning multiagent communication with backpropagation. 2016, to be published.
    [29] Peng P, Wen Y, Yang Y, Yuan Q, Tang Z, Long H, Wang J. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play starcraft combat games. 2017, to be published.
    [30] Singh A, Jain T, Sukhbaatar S. Learning when to communicate at scale in multiagent cooperative and competitive tasks. 2018, to be published.
    [31] Fu J, Li W, Du J, Huang Y. A multiscale residual pyramid attention network for medical image fusion. Biomedical Signal Processing and Control, 2021, 66: 102488. doi: 10.1016/j.bspc.2021.102488
    [32] Locatello F, Weissenborn D, Unterthiner T, Mahendran A, Heigold G, Uszkoreit J, Dosovitskiy A, Kipf T. Object-centric learning with slot attention. 2020, to be published.
    [33] Jiang J, Lu Z. Learning attentional communication for multi-agent cooperation. 2018, to be published.
    [34] Das A, Gervet T, Romoff J, Batra D, Parikh D, Rabbat M, Pineau J. Tarmac: Targeted multi-agent communication. In: International Conference on Machine Learning. PMLR, 2019. 1538−1546.
    [35] Liu Y, Wang W, Hu Y, Hao J, Chen X, Gao Y. Multi-agent game abstraction via graph attention neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34. 7211−7218.
    [36] Raileanu R, Denton E, Szlam A, Fergus R. Modeling others using oneself in multi-agent reinforcement learning. In: International conference on machine learning. PMLR, 2018. 4257−4266.
    [37] Jaques N, Lazaridou A, Hughes E, Gulcehre C, Ortega P, Strouse D, Leibo J Z, De Freitas N. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In: International Conference on Machine Learning. PMLR, 2019. 3040−3049.
    [38] Littman M L. Markov games as a framework for multi-agent reinforcement learning. In: Machine learning proceedings. Elsevier, 1994. 157−163.
    [39] Samvelyan M, Rashid T, De Witt C S, Farquhar G, Nardelli N, Rudner T G, Hung C M, Torr P H, Foerster J, Whiteson S. The starcraft multi-agent challenge. 2019, to be published.
    [40] Yu W W, Wang R, Li R Y, Gao J, Hu X H. Historical best q-networks for deep reinforcement learning. In: 2018 IEEE 30th International Conference on Tools with Artificial Intelligence. IEEE, 2018. 6−11.
    [41] Anschel O, Baram N, Shimkin N. Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning. In: International Conference on Machine Learning. PMLR, 2017. 176−185.
  • 加载中
计量
  • 文章访问数:  392
  • HTML全文浏览量:  237
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-05-18
  • 录用日期:  2021-09-17
  • 网络出版日期:  2021-10-20

目录

    /

    返回文章
    返回