2.765

2022影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

面向多智能体协作的注意力意图与交流学习方法

俞文武 杨晓亚 李海昌 王瑞 胡晓惠

俞文武, 杨晓亚, 李海昌, 王瑞, 胡晓惠. 面向多智能体协作的注意力意图与交流学习方法. 自动化学报, 2023, 49(11): 2311−2325 doi: 10.16383/j.aas.c210430
引用本文: 俞文武, 杨晓亚, 李海昌, 王瑞, 胡晓惠. 面向多智能体协作的注意力意图与交流学习方法. 自动化学报, 2023, 49(11): 2311−2325 doi: 10.16383/j.aas.c210430
Yu Wen-Wu, Yang Xiao-Ya, Li Hai-Chang, Wang Rui, Hu Xiao-Hui. Attentional intention and communication for multi-agent learning. Acta Automatica Sinica, 2023, 49(11): 2311−2325 doi: 10.16383/j.aas.c210430
Citation: Yu Wen-Wu, Yang Xiao-Ya, Li Hai-Chang, Wang Rui, Hu Xiao-Hui. Attentional intention and communication for multi-agent learning. Acta Automatica Sinica, 2023, 49(11): 2311−2325 doi: 10.16383/j.aas.c210430

面向多智能体协作的注意力意图与交流学习方法

doi: 10.16383/j.aas.c210430
基金项目: 国家重点研发计划(2019YFB1405100), 国家自然科学基金(61802380, 61802016)资助
详细信息
    作者简介:

    俞文武:中国科学院软件研究所博士研究生. 2016年获得湖南大学学士学位. 主要研究方向为深度强化学习.E-mail: wenwu2016@iscas.ac.cn

    杨晓亚:中国科学院软件研究所硕士研究生. 2017年获得吉林大学学士学位. 主要研究方向为强化学习. E-mail: yangxiaoya17@mails.ucas.ac.cn

    李海昌:中国科学院软件研究所副研究员. 2016年获得中国科学院自动化研究所博士学位. 主要研究方向为计算机视觉, 模式识别和深度学习. 本文通信作者. E-mail: haichang@iscas.ac.cn

    王瑞:中国科学院软件研究所工程师. 2012年获得山东大学硕士学位. 主要研究方向为智能信息处理. E-mail: wangrui@iscas.ac.cn

    胡晓惠:中国科学院软件研究所研究员. 2003年获得北京航空航天大学博士学位. 主要研究方向为智能信息处理与系统集成. E-mail: hxh@iscas.ac.cn

Attentional Intention and Communication for Multi-agent Learning

Funds: Supported by National Key Research and Development Program of China (2019YFB1405100) and National Natural Science Foundation of China (61802380, 61802016)
More Information
    Author Bio:

    YU Wen-Wu Ph.D. candidate at the Institute of Software, Chinese Academy of Sciences. He received his bachelor degree from Hunan University in 2016. His main research interest is deep reinforcement learning

    YANG Xiao-Ya Master student at the Institute of Software, Chinese Academy of Sciences. She received her bachelor degree from Jilin University in 2017. Her main research interest is reinforcement learning

    LI Hai-Chang Associate professor at the Institute of Software, Chinese Academy of Sciences. He received his Ph.D. degree from Institute of Automation, Chinese Academy of Sciences in 2016. His research interest covers computer vision, pattern recognition, and deep learning. Corresponding author of this paper

    WANG Rui Engineer at the Institute of Software, Chinese Academy of Sciences. She received her master degree from Shandong University in 2012. Her main research interest is intelligent information processing

    HU Xiao-Hui Professor at the Institute of Software, Chinese Aca-demy of Sciences. He received his Ph.D. degree from Beihang University in 2003. His research interest covers intelligent information processing and system integration

  • 摘要: 对于部分可观测环境下的多智能体交流协作任务, 现有研究大多只利用了当前时刻的网络隐藏层信息, 限制了信息的来源. 研究如何使用团队奖励训练一组独立的策略以及如何提升独立策略的协同表现, 提出多智能体注意力意图交流算法(Multi-agent attentional intention and communication, MAAIC), 增加了意图信息模块来扩大交流信息的来源, 并且改善了交流模式. 将智能体历史上表现最优的网络作为意图网络, 且从中提取策略意图信息, 按时间顺序保留成一个向量, 最后结合注意力机制推断出更为有效的交流信息. 在星际争霸环境中, 通过实验对比分析, 验证了该算法的有效性.
  • 图  1  MAAIC算法框架

    Fig.  1  Overall framework of MAAIC algorithm

    图  2  对意图网络进行自注意力信息的提取

    Fig.  2  Extracting self attention information from intention network

    图  3  基于集中式训练分布式执行的多智能体同环境交互

    Fig.  3  Multi-agent interaction with environment under centralized training and decentralized execution

    图  4  交流通道使用的多头注意力模型

    Fig.  4  Multihead attention model used in communication channels

    图  5  MAAIC-VDN算法在SMAC上的实验结果

    Fig.  5  Experimental results of MAAIC-VDN algorithm on SMAC

    图  6  MAAIC-QMIX算法在SMAC上的实验结果

    Fig.  6  Experimental results of MAAIC-QMIX algorithm on SMAC

    图  7  MAAIC-QTRAN算法在SMAC上的实验结果

    Fig.  7  Experimental results of MAAIC-QTRAN algorithm on SMAC

    图  8  交流结构消融性实验结果

    Fig.  8  Experimental ablation results of the communication structure

    图  9  意图网络数的消融性实验结果

    Fig.  9  Experimental ablation results of the number of intention networks

    图  10  MAAIC-VDN算法在不同意图网络数的时间开销

    Fig.  10  Time cost for different numbers of intention units based on MAAIC-VDN algorithm

    图  11  历史最优网络和最近邻网络作为MAAIC消融性实验结果

    Fig.  11  Experimental ablation results of MAAIC with the best Q-network and the nearest Q-network

    图  12  内在意图奖励实验结果

    Fig.  12  Experimental results of intrinsic intention rewards

    表  1  SMAC实验场景

    Table  1  Experimental scenarios under SMAC

    场景名称我方单位敌方单位类型
    5m_vs_6m5名海军陆战队6名海军陆战队同构但不对称
    3s_vs_5z3潜行者5 狂热者微型技巧: 风筝
    2s_vs_1sc2缠绕者1脊柱爬行者微技巧交火
    3s5z3潜行者 & 5狂热者3潜行者和
    5狂热者
    异构且对称
    6h_vs_8z6蛇蝎8狂热者微招: 集中火力
    下载: 导出CSV

    表  2  测试算法的最大中值实验结果 (%)

    Table  2  Maximum median performance of the algorithms tested (%)

    场景MAAIC-VDNVDNIQLMAAIC-QMIXQMIXHeuristicMAAIC-QTRANQTRAN
    2s_vs_1sc1001001001001000100100
    3s5z908799791423120
    5m_vs_6m877859747506758
    3s_vs_5z987346989709715
    6h_vs_8z55003130220
    下载: 导出CSV

    表  3  MAAIC-VDN算法在不同意图网络数的GPU内存开销 (MB)

    Table  3  GPU memory cost for different numbers of intention networks based on MAAIC-VDN algorithm (MB)

    场景5个意图网络3个意图网络1个意图网络VDN with TarMACVDN
    2s_vs_1sc1560151014701120680
    5m_vs_6m1510150015001150680
    3s_vs_5z2120209021001480730
    下载: 导出CSV

    B1  MAAIC算法网络参数

    B1  Network parameters of MAAIC algorithm

    参数名设置值说明
    rnn_hidden_dim64对于局部观测的全连接特征编码维度, 循环网络的隐藏层维度
    attention_dim164意图信息的注意力编码维度
    attention_dim2$64\times 8$多头注意力机制的编码维度
    n_intention3意图网络的个数
    下载: 导出CSV

    B2  SMAC环境下MAAIC算法训练参数

    B2  Training parameters of MAAIC algorithm in SMAC

    参数名设置值说明
    Lr0.0005损失函数的学习率
    Optim_eps0.00001RMSProp加到分母提升数值稳定性
    Epsilon1探索的概率值
    Min_epsilon0.05最低探测概率值
    Anneal_steps50000模拟退火的步数
    Epsilon_anneal_scalestep探索概率值的退火方式
    N_epoch20000训练的总轮数
    N_episodes1每轮的游戏局数目
    Evaluate_cycle100评估周期间隔
    Evaluate_epoch20评估次数
    Batch_size32训练的批数据大小
    Buffer_size5000内存池大小
    Target_update_cycle200目标网络更新间隔
    Grad_norm_clip10梯度裁剪, 防止梯度爆炸
    下载: 导出CSV

    C1  MPP环境下MAAIC算法训练参数

    C1  Training parameters of MAAIC algorithm in MPP

    参数名设置值说明
    Training step3000000训练最大步数
    Learning rate0.0005Adam优化的学习率
    Replay buffer size600000最大的样本存储数量
    Mini-batch size_epsilon32更新参数所用到的样本数量
    Anneal_steps500000模拟退火的步数
    $\alpha$1外在奖励系数
    $\beta $0.5内在奖励系数
    下载: 导出CSV
  • [1] Kurach K, Raichuk A, Stańczyk P, Zajac M, Bachem O, Espeholt L, et al. Google research football: A novel reinforcement learning environment. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, USA: 2020. 4501−4510
    [2] Ye D, Liu Z, Sun M, Sun M, Shi B, Zhao P, et al. Mastering complex control in MOBA games with deep reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, USA: 2020. 6672−6679
    [3] Ohmer X, Marino M, Franke M, König P. Why and how to study the impact of perception on language emergence in artificial agents. In: Proceedings of the Annual Meeting of the Cognitive Science Society. Virtual Event: 2021.
    [4] 姚红革, 张玮, 杨浩琪, 喻钧. 深度强化学习联合回归目标定位. 自动化学报, 2020, 41: 1-10

    Yao Hong-Ge, Zhang Wei, Yang Hao-Qi, Yu Jun. Joint regression object localization based on deep reinforcement learning. Acta Automatica Sinica, 2020, 41: 1-10
    [5] 吴晓光, 刘绍维, 杨磊, 邓文强, 贾哲恒. 基于深度强化学习的双足机器人斜坡步态控制方法. 自动化学报, 2020, 46: 1-12

    Wu Xiao-Guang, Liu Shao-Wei, Yang Lei, Deng Wen-Qiang, Jia Zhe-Heng. A gait control method for biped robot on slope based on deep reinforcement learning. Acta Automatica Sinica, 2020, 46: 1-12
    [6] 孙长银, 穆朝絮. 多智能体深度强化学习的若干关键科学问题. 自动化学报, 2020, 46(7): 71-79

    Sun Chang-Yin, Mu Chao-Xu. Important scientific problems of multi-agent deep reinforcement learning. Acta Automatica Sinica. 2020, 46(7): 71-79.
    [7] Lillicrap T P, Hunt J J, Pritzel A, Heess N, Erez T, Tassa Y, et al. Continuous control with deep reinforcement learning. In: Proceedings of the International Conference on Learning Representations. San Juan, Puerto Rico: 2016.
    [8] Tampuu A, Matiisen T, Kodelja D, Kuzovkin I, Korjus K, Aru J, et al. Multi-agent cooperation and competition with deep reinforcement learning. Plos One, 2017, 12(4). Article No. e0172395
    [9] Sunehag P, Lever G, Gruslys A, Czarnecki M W, Zambaldi V, Jaderberg M, et al. Value-decomposition networks for cooperative multiagent learning. In: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. Stockholm, Sweden: 2017. 2085−2087
    [10] Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S. QMIX: Monotonic value function factorization for deep multi-agent reinforcement learning. In: Proceedings of the International Conference on Machine Learning. Stockholm, Sweden: PMLR, 2018. 4295−4304
    [11] Rashid T, Farquhar G, Peng B, Whiteson S. Weighted QMIX: Expanding monotonic value function factorization for deep multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 2020, 33: 10199−10210
    [12] Wang J H, Ren Z, Liu T, Yu Y, Zhang C. Qplex: Duplex dueling multi-agent Q-learning. In: Proceedings of the International Conference on Learning Representations. Virtual Event: 2021.
    [13] Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv: 1406.1078, 2014.
    [14] Gupta J K, Egorov M, Kochenderfer M. Cooperative multi-agent control using deep reinforcement learning. In: Proceedings of the International Conference on Autonomous Agents and Multi-agent Systems. São Paulo, Brazil: Springer, 2017. 66−83
    [15] Busoniu L, Babuska R, De Schutter B. Multi-agent reinforcement learning: A survey. In: Proceedings of the 9th International Conference on Control, Automation, Robotics and Vision. Singapore: IEEE, 2006. 1−6
    [16] Hernandez-Leal P, Kartal B, Taylor M E. A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems, 2019, 33(6): 750-797. doi: 10.1007/s10458-019-09421-1
    [17] Tan M. Multi-agent reinforcement learning: Independent vs. cooperative agents. In: Proceedings of the 10th International Conference on Machine Learning. Amherst, USA: 1993. 330−337
    [18] Hernandez-Leal P, Kartal B, Taylor M E. Is multiagent deep reinforcement learning the answer or the question? a brief survey. Learning, 2018, 21:22.
    [19] Oroojlooyjadid A, Hajinezhad D. A review of cooperative multi-agent deep reinforcement learning. arXiv preprint arXiv: 1810.05587, 2018.
    [20] Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I. Multi-agent actor-critic for mixed cooperative-competitive environments. arXiv preprint arXiv: 1706.02275, 2017.
    [21] Pesce E, Montana G. Improving coordination in small-scale multi-agent deep reinforcement learning through memory-driven communication. Machine Learning, 2020, 1-21.
    [22] Kim W, Cho M, Sung Y. Message-dropout: An efficient training method for multi-agent deep reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. Hawaii, USA: 2019. 6079−6086
    [23] Foerster J, Farquhar G, Afouras T, Nardelli N, Whiteson S. Counterfactual multi-agent policy gradients. In: Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans, USA: 2018. 2974−2982
    [24] Son K, Kim D, Kang W J, Hostallero D E, Yi Y. QTRAN: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In: Proceedings of the International Conference on Machine Learning. Long Beach, USA: PMLR, 2019. 5887−5896
    [25] Yang Y, Hao J, Liao B, Shao K, Chen G, Liu W, et al. Qatten: A general framework for cooperative multi-agent reinforcement learning. CoRR, 2020, Article No. 03939
    [26] Yang Y, Hao J, Chen G, Tang H, Chen Y, Hu Y, et al. Q-value path decomposition for deep multi-agent reinforcement learning. In: Proceedings of the International Conference on Machine Learning. Virtual Event: PMLR, 2020. 10706−10715
    [27] Foerster J N, Assael Y M, De Freitas N, Whiteson S. Learning to communicate with deep multi-agent reinforcement learning. arXiv preprint arXiv: 1605.06676, 2016.
    [28] Sukhbaatar S, Szlam A, Fergus R. Learning multi-agent communication with back-propagation. In: Proceedings of the Annual Conference on Neural Information Processing Systems. Barcelona, Spain: 2016. 2244−2252
    [29] Peng P, Wen Y, Yang Y, Yuan Q, Tang Z, Long H, et al. Multi-agent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play star-craft combat games. arXiv preprint arXiv: 1703.10069, 2017.
    [30] Singh A, Jain T, Sukhbaatar S. Learning when to communicate at scale in multi-agent cooperative and competitive tasks. arXiv preprint arXiv: 1812.09755, 2018.
    [31] Fu J, Li W, Du J, Huang Y. A multiscale residual pyramid attention network for medical image fusion. Biomedical Signal Processing and Control, 2021, 66: 102488. doi: 10.1016/j.bspc.2021.102488
    [32] Locatello F, Weissenborn D, Unterthiner T, Mahendran A, Heigold G, Uszkoreit J, et al. Object-centric learning with slot attention. arXiv preprint arXiv: 2006.15055, 2020.
    [33] Jiang J, Lu Z. Learning attentional communication for multi-agent cooperation. arXiv preprint arXiv: 1805.07733, 2018.
    [34] Das A, Gervet T, Romoff J, Batra D, Parikh D, Rabbat M, et al. TarMAC: Targeted multi-agent communication. In: Proceedings of the International Conference on Machine Learning. Long Beach, USA: PMLR, 2019. 1538−1546
    [35] Liu Y, Wang W, Hu Y, Hao J, Chen X, Gao Y. Multi-agent game abstraction via graph attention neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, USA: 2020, 34. 7211−7218
    [36] Raileanu R, Denton E, Szlam A, Fergus R. Modeling others using oneself in multi-agent reinforcement learning. In: Proceedings of the International Conference on Machine Learning. Stockholm, Sweden: 2018. 4257−4266
    [37] Jaques N, Lazaridou A, Hughes E, Gulcehre C, Ortega P, Strouse D, et al. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In: Proceedings of the International Conference on Machine Learning. Long Beach, USA: PMLR, 2019. 3040−3049
    [38] Littman M L. Markov games as a framework for multi-agent reinforcement learning. In: Proceedings of the Machine Learning Proceedings. New Brunswick, USA: 1994. 157−163
    [39] Samvelyan M, Rashid T, De Witt C S, Farquhar G, Nardelli N, Rudner T G, et al. The star-craft multi-agent challenge. In: Proceedings of the Autonomous Agents and Multi-agent Systems. Montreal, Canada: 2019. 2186−2188
    [40] Yu W W, Wang R, Li R Y, Gao J, Hu X H. Historical best Q-networks for deep reinforcement learning. In: Proceedings of the IEEE 30th International Conference on Tools With Artificial Intelligence. Volos, Greece: IEEE, 2018. 6−11
    [41] Anschel O, Baram N, Shimkin N. Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning. In: Proceedings of the International Conference on Machine Learning. Sydney, Australia: PMLR, 2017. 176−185
  • 加载中
图(12) / 表(6)
计量
  • 文章访问数:  1257
  • HTML全文浏览量:  701
  • PDF下载量:  274
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-05-18
  • 录用日期:  2021-09-17
  • 网络出版日期:  2021-10-20
  • 刊出日期:  2023-11-22

目录

    /

    返回文章
    返回