JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

Optimization methods in fully cooperative scenarios: a review of multiagent reinforcement learning

Author(s): Tao YANG, Xinhao SHI, Qinghan ZENG, Yulin YANG, Cheng XU, Hongzhe LIU
Affiliation(s): Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing 100101, China; more
Corresponding email(s): 20231083510923@buu.edu.cn, 20221083510927@buu.edu.cn, xc-f4@163.com
Key Words: Multiagent reinforcement learning (MARL); Cooperative framework; Reward function; Cooperative objective optimization

Share this article to： More \|Next Paper >>>

Tao YANG, Xinhao SHI, Qinghan ZENG, Yulin YANG, Cheng XU, Hongzhe LIU. Optimization methods in fully cooperative scenarios: a review of multiagent reinforcement learning[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2400259

@article{title="Optimization methods in fully cooperative scenarios: a review of multiagent reinforcement learning",
author="Tao YANG, Xinhao SHI, Qinghan ZENG, Yulin YANG, Cheng XU, Hongzhe LIU",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2400259"
}

%0 Journal Article
%T Optimization methods in fully cooperative scenarios: a review of multiagent reinforcement learning
%A Tao YANG
%A Xinhao SHI
%A Qinghan ZENG
%A Yulin YANG
%A Cheng XU
%A Hongzhe LIU
%J Frontiers of Information Technology & Electronic Engineering
%P 479-509
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2400259"

TY - JOUR
T1 - Optimization methods in fully cooperative scenarios: a review of multiagent reinforcement learning
A1 - Tao YANG
A1 - Xinhao SHI
A1 - Qinghan ZENG
A1 - Yulin YANG
A1 - Cheng XU
A1 - Hongzhe LIU
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 479
EP - 509
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2400259"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Multiagent reinforcement learning (MARL) has become a dazzling new star in the field of reinforcement learning in recent years, demonstrating its immense potential across many application scenarios. The reward function directs agents to explore their environments and make optimal decisions within them by establishing evaluation criteria and feedback mechanisms. Concurrently, cooperative objectives at the macro level provide a trajectory for agents’ learning, ensuring alignment between individual behavioral strategies and the overarching system goals. The interplay between reward structures and cooperative objectives not only bolsters the effectiveness of individual agents but also fosters interagent collaboration, offering both momentum and direction for the development of swarm intelligence and the harmonious operation of multiagent systems. This review delves deeply into the methods for designing reward structures and optimizing cooperative objectives in MARL, along with the most recent scientific advancements in this field. The article meticulously reviews the application of simulation environments in cooperative scenarios and discusses future trends and potential research directions in the field, providing a forward-looking perspective and inspiration for subsequent research efforts.

完全合作场景中的优化方法：多智能体强化学习综述

杨涛^1,2，史鑫昊^1,2，曾庆含²，杨玉林²，徐成¹，刘宏哲¹
¹北京联合大学北京市信息服务工程重点实验室，中国北京市，100101
²中国人民解放军32178部队科技创新研究中心，中国北京市，100012
摘要：近年来，多智能体强化学习已成为强化学习领域一颗耀眼的新星，展现了其在众多应用场景的巨大潜力。奖励函数通过建立评估标准和反馈机制，引导智能体在其环境中探索并做出最优决策。同时，宏观层面的协作目标为智能体的学习提供了轨迹，确保个体行为策略与整体系统目标的高度一致性。奖励结构与协作目标之间的相互作用，不仅增强了个体智能体的有效性，还促进了智能体之间的协作，为群体智能的发展和多智能体系统的和谐运行提供了动力和方向。本文深入探讨了多智能体强化学习中奖励结构的设计方法及协作目标的优化策略，详细审视了这些领域的最新科学进展。此外，对协作场景中的仿真环境应用进行了深入评述，讨论了该领域的未来发展趋势及潜在研究方向，为后续研究提供了前瞻视角与灵感。

关键词组：多智能体强化学习；合作框架；奖励函数；合作目标优化

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Abdel-Aziz MK, Elbamby MS, Samarakoon S, et al., 2024. Cooperative multi-agent learning for navigation via structured state abstraction. IEEE Trans Commun, 72(6):3454-3462.

[2]Andrew AM, 1999. Reinforcement learning: an introduction by Richard S. Sutton and Andrew G. Barto, Adaptive Computation and Machine Learning Series, MIT Press (Bradford Book), Cambridge, Mass., 1998, xviii + 322 pp, ISBN 0-262-19398-1, (hardback, ￡31.95). Robotica, 17(2):229-235.

[3]Arulkumaran K, Cully A, Togelius J, 2019. AlphaStar: an evolutionary computation perspective. Proc Genetic and Evolutionary Computation Conf Companion, p.314-315.

[4]Badnava B, Esmaeili M, Mozayani N, et al., 2023. A new potential-based reward shaping for reinforcement learning agent. Proc IEEE 13^th Annual Computing and Communication Workshop and Conf, p.1-6.

[5]Baker B, Kanitscheider I, Markov TM, et al., 2020. Emergent tool use from multi-agent autocurricula. Proc 8^th Int Conf on Learning Representations.

[6]Bellemare MG, Srinivasan S, Ostrovski G, et al., 2016. Unifying count-based exploration and intrinsic motivation. Proc 30^th Conf on Neural Information Processing Systems.

[7]Bernstein DS, Givan R, Immerman N, et al., 2002. The complexity of decentralized control of Markov decision processes. Math Oper Res, 27(4):819-840.

[8]Burns C, Izmailov P, Kirchner JH, et al., 2024. Weak-to-strong generalization: eliciting strong capabilities with weak supervision. Proc 41^st Int Conf on Machine Learning.

[9]Canese L, Cardarilli GC, di Nunzio L, et al., 2024. Resilient multi-agent RL: introducing DQ-RTS for distributed environments with data loss. Sci Rep, 14(1):1994.

[10]Cao HH, Xiong H, Zeng WF, et al., 2024. Safe reinforcement learning-based motion planning for functional mobile robots suffering uncontrollable mobile robots. IEEE Trans Intell Transp Syst, 25(5):4346-4363.

[11]Cao SH, Zhang HQ, Wen T, et al., 2024. FedQMIX: communication-efficient federated learning via multi-agent reinforcement learning. High-Confid Comput, 4(2):100179.

[12]Carroll M, Shah R, Ho MK, et al., 2019. On the utility of learning about humans for human-AI coordination. Proc 33^rd Conf on Neural Information Processing Systems, Article 465.

[13]Charakorn R, Manoonpong P, Dilokthanakul N, 2020. Investigating partner diversification methods in cooperative multi-agent deep reinforcement learning. Proc 27^th Int Conf on Neural Information Processing, p.395-402.

[14]Chen CQ, Yang HN, Zhai CJ, et al., 2024. Competitive pricing for ride-sourcing platforms with MARL. Transp Res Part C Emerg Technol, 165:104697.

[15]Chen E, Hong ZW, Pajarinen J, et al., 2022. Redeeming intrinsic rewards via constrained optimization. Proc 36^th Conf on Neural Information Processing Systems, p.4996-5008.

[16]Chen HB, Ji WK, Xu LF, et al., 2023. Multi-agent consensus seeking via large language models.

[17]Chen JY, Xu ZL, Li YF, et al., 2024. Accelerate multi-agent reinforcement learning in zero-sum games with subgame curriculum learning. Proc 38^th AAAI Conf on Artificial Intelligence, p.11320-11328.

[18]Choi J, Guo YJ, Moczulski M, et al., 2019. Contingency-aware exploration in reinforcement learning. Proc 7^th Int Conf on Learning Representations.

[19]Cui HY, Zhang Z, 2021. A cooperative multi-agent reinforcement learning method based on coordination degree. IEEE Access, 9:123805-123814.

[20]Dabney W, Kurth-Nelson Z, Uchida N, et al., 2020. A distributional code for value in dopamine-based reinforcement learning. Nature, 577(7792):671-675.

[21]Das A, Gervet T, Romoff J, et al., 2019. TarMAC: targeted multi-agent communication. Proc 36^th Int Conf on Machine Learning, p.1538-1546.

[22]Devlin S, Kudenko D, 2012. Dynamic potential-based reward shaping. Proc 11^th Int Conf on Autonomous Agents and Multiagent Systems, p.433-440.

[23]de Witt CS, Gupta T, Makoviichuk D, et al., 2020. Is independent learning all you need in the StarCraft Multi-Agent Challenge?

[24]de Witt CS, Peng B, Kamienny PA, et al., 2021. Deep multi-agent reinforcement learning for decentralized continuous cooperative control.

[25]Ding ZL, Huang TJ, Lu ZQ, 2020. Learning individually inferred communication for multi-agent cooperation. Proc 34^th Int Conf on Neural Information Processing Systems, p.22069-22079.

[26]Du W, Ding SF, Zhang CL, et al., 2023. Multiagent reinforcement learning with heterogeneous graph attention network. IEEE Trans Neur Netw Learn Syst, 34(10):6851-6860.

[27]Du XQ, Chen HC, Xing YH, et al., 2024. A contrastive-enhanced ensemble framework for efficient multi-agent reinforcement learning. Exp Syst Appl, 245:123158.

[28]ElSayed-Aly I, Feng L, 2022. Logic-based reward shaping for multi-agent reinforcement learning.

[29]Eysenbach B, Gupta A, Ibarz J, et al., 2019. Diversity is all you need: learning skills without a reward function. Proc 7^th Int Conf on Learning Representations.

[30]Feng L, Xie YX, Liu B, et al., 2022. Multi-level credit assignment for cooperative multi-agent reinforcement learning. Appl Sci, 12(14):6938.

[31]Foerster JN, Assael YM, de Freitas N, et al., 2016. Learning to communicate to solve riddles with deep distributed recurrent Q-networks.

[32]Foerster JN, Farquhar G, Afouras T, et al., 2018. Counterfactual multi-agent policy gradients. Proc 32^nd AAAI Conf on Artificial Intelligence, p.2974-2982.

[33]Fox L, Choshen L, Loewenstein Y, 2018. DORA the Explorer: directed outreaching reinforcement action-selection. Proc 6^th Int Conf on Learning Representations.

[34]Fu W, Yu C, Xu ZL, et al., 2022. Revisiting some common practices in cooperative multi-agent reinforcement learning. Proc 39^th Int Conf on Machine Learning, p.6863-6877.

[35]Gao F, Chen S, Li MQ, et al., 2019. MaCA: a multi-agent reinforcement learning platform for collective intelligence. Proc IEEE 10^th Int Conf on Software Engineering and Service Science, p.108-111.

[36]Gibbons R, 1992. A Primer in Game Theory. Pearson Academic, New York, USA.

[37]Gong ZH, Xu Y, Luo DL, 2023. UAV cooperative air combat maneuvering confrontation based on multi-agent reinforcement learning. Unmann Syst, 11(3):273-286.

[38]Gou Y, Zhang T, Yang TT, et al., 2022. A deep MARL-based power-management strategy for improving the fair reuse of UWSNs. IEEE Int Things J, 10(7):6507-6522.

[39]Graves A, Bellemare MG, Menick J, et al., 2017. Automated curriculum learning for neural networks. Proc 34^th Int Conf on Machine Learning, p.1311-1320.

[40]Gu SD, Grudzien Kuba J, Chen YP, et al., 2023. Safe multi-agent reinforcement learning for multi-robot control. Artif Intell, 319:103905.

[41]Gu SD, Huang DY, Wen MN, et al., 2024. Safe multiagent learning with soft constrained policy optimization in real robot control. IEEE Trans Ind Inform, 20(9):10706-10716.

[42]Guo LX, Pan HX, Duan XM, et al., 2023. Balancing efficiency and unpredictability in multi-robot patrolling: a MARL-based approach. Proc IEEE Int Conf on Robotics and Automation, p.3504-3509.

[43]Gupta JK, Egorov M, Kochenderfer M, 2017. Cooperative multi-agent control using deep reinforcement learning. Proc Int Conf on Autonomous Agents and Multiagent Systems, p.66-83.

[44]Haarnoja T, Zhou A, Abbeel P, et al., 2018. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proc 35^th Int Conf on Machine Learning, p.1861-1870.

[45]Hairi FNU, Liu J, Lu ST, 2022. Finite-time convergence and sample complexity of multi-agent actor-critic reinforcement learning with average reward. Proc 10^th Int Conf on Learning Representations.

[46]Han DG, Lu CX, Michalak T, et al., 2022. Multiagent model-based credit assignment for continuous control. Proc 21^st Int Conf on Autonomous Agents and Multiagent Systems, p.571-579.

[47]Hao JY, Yang TP, Tang HY, et al., 2024. Exploration in deep reinforcement learning: from single-agent to multiagent domain. IEEE Trans Neur Netw Learn Syst, 35(7):8762-8782.

[48]Hao XT, Mao HY, Wang WX, et al., 2022. Breaking the curse of dimensionality in multiagent state space: a unified agent permutation framework.

[49]Harsanyi JC, 1967. Games with incomplete information played by “Bayesian” players, I–III Part I. The basic model. Manag Sci, 14(3):159-182.

[50]Harutyunyan A, Devlin S, Vrancx P, et al., 2015. Expressing arbitrary reward functions as potential-based advice. Proc 29^th AAAI Conf on Artificial Intelligence, p.2652-2658.

[51]Hessel M, Modayil J, van Hasselt H, et al., 2018. Rainbow: combining improvements in deep reinforcement learning. Proc 32^nd AAAI Conf on Artificial Intelligence, p.3215-3222.

[52]Hou YK, Wei ZW, Liu SY, et al., 2023. Cross-regional task offloading with multi-agent reinforcement learning for hierarchical vehicular fog computing. Proc IEEE Symp on Computers and Communications, p.272-277.

[53]Hu JF, Sun YC, Chen HC, et al., 2022. Distributional reward estimation for effective multi-agent deep reinforcement learning. Proc 36^th Conf on Neural Information Processing Systems, p.12619-12632.

[54]Hua WY, Fan LZ, Li LY, et al., 2024. War and peace (WarAgent): LLM-based multi-agent simulation of world wars.

[55]Huang JB, Tan QL, Qi RJ, et al., 2024. RELight: a random ensemble reinforcement learning based method for traffic light control. Appl Intell, 54(1):95-112.

[56]Icarte RT, Klassen TQ, Valenzano R, et al., 2022. Reward machines: exploiting reward function structure in reinforcement learning. J Artif Intell Res, 73:173-208.

[57]Jaques N, Lazaridou A, Hughes E, et al., 2019. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. Proc 36^th Int Conf on Machine Learning, p.3040-3049.

[58]Jeon HC, Baek IC, Bae CM, et al., 2023. RaidEnv: exploring new challenges in automated content balancing for boss raid games. IEEE Trans Games, 16(3):645-658.

[59]Jeon J, Kim W, Jung W, et al., 2022. MASER: multi-agent reinforcement learning with subgoals generated from experience replay buffer. Proc 39^th Int Conf on Machine Learning, p.10041-10052.

[60]Ji JM, Qiu TY, Chen BY, et al., 2024. AI alignment: a comprehensive survey.

[61]Jia LY, Cai CT, Wang XM, et al., 2023. Multi-intent autonomous decision-making for air combat with deep reinforcement learning. Appl Intell, 53(23):29076-29093.

[62]Jiang H, Liu YT, Li SZ, et al., 2022. Diverse effective relationship exploration for cooperative multi-agent reinforcement learning. Proc 31^st ACM Int Conf on Information & Knowledge Management, p.842-851.

[63]Jiang JC, Lu ZQ, 2018. Learning attentional communication for multi-agent cooperation. Proc 32^nd Int Conf on Neural Information Processing Systems, p.7265-7275.

[64]Jo Y, Lee S, Yeom J, et al., 2024. FoX: formation-aware exploration in multi-agent reinforcement learning. Proc 38^th AAAI Conf on Artificial Intelligence, p.12985-12994.

[65]Khan MJ, Ahmed SH, Sukthankar G, 2022. Transformer-based value function decomposition for cooperative multi-agent reinforcement learning in StarCraft. Proc 18^th AAAI Conf on Artificial Intelligence and Interactive Digital Entertainment, p.113-119.

[66]Khan R, Khan N, Ahmad T, 2023. Communication in multi-agent reinforcement learning: a survey. Nucleus, 60(2):175-185.

[67]Kim SH, van Stralen N, Chowdhary G, et al., 2022. Disentangling successor features for coordination in multi-agent reinforcement learning. Proc 21^st Int Conf on Autonomous Agents and Multiagent Systems, p.751-760.

[68]Kong WR, Zhou DY, Du YJ, et al., 2023. Hierarchical multi-agent reinforcement learning for multi-aircraft close-range air combat. IET Contr Theory Appl, 17(13):1840-1862.

[69]Krajzewicz D, 2010. Traffic simulation with SUMO-simulation of urban mobility. In: Barceló J (Ed.), Fundamentals of Traffic Simulation. Springer, New York, p.269-293.

[70]Kuba JG, Wen MN, Meng LH, et al., 2021. Settling the variance of multi-agent policy gradients. Proc 35^th Conf on Neural Information Processing Systems, p.13458-13470.

[71]Kuba JG, Chen RQ, Wen MN, et al., 2022. Trust region policy optimisation in multi-agent reinforcement learning. Proc 10^th Int Conf on Learning Representations.

[72]Kurach K, Raichuk A, Stańczyk P, et al., 2020. Google Research Football: a novel reinforcement learning environment. Proc 34^th AAAI Conf on Artificial Intelligence, p.4501-4510.

[73]Lanctot M, Zambaldi V, Gruslys A, et al., 2017. A unified game-theoretic approach to multiagent reinforcement learning. Proc 31^st Int Conf on Neural Information Processing Systems, p.4193-4206.

[74]Laskin M, Wang LY, Oh J, et al., 2023. In-context reinforcement learning with algorithm distillation. Proc 11^th Int Conf on Learning Representations.

[75]Leroy P, Morato PG, Pisane J, et al., 2024. IMP-MARL: a suite of environments for large-scale infrastructure management planning via MARL. Proc 37^th Int Conf on Neural Information Processing Systems, Article 2329.

[76]Li CM, Liu J, Zhang YM, et al., 2023. ACE: cooperative multi-agent Q-learning with bidirectional action-dependency. Proc 37^th AAAI Conf on Artificial Intelligence, p.8536-8544.

[77]Li DP, Xu ZW, Zhang B, et al., 2024. From explicit communication to tacit cooperation: a novel paradigm for cooperative MARL. Proc 23^rd Int Conf on Autonomous Agents and Multiagent Systems, p.2360-2362.

[78]Li GH, Hammoud HAAK, Itani H, et al., 2023. CAMEL: communicative agents for “mind” exploration of large language model society.

[79]Li HP, He HB, 2024. Multiagent trust region policy optimization. IEEE Trans Neur Netw Learn Syst, 35(9):12873-12887.

[80]Li K, Gupta A, Reddy A, et al., 2021. MURAL: meta-learning uncertainty-aware rewards for outcome-driven reinforcement learning. Proc 38^th Int Conf on Machine Learning, p.6346-6356.

[81]Li QY, Peng ZH, Feng L, et al., 2023. MetaDrive: composing diverse driving scenarios for generalizable reinforcement learning. IEEE Trans Patt Anal Mach Intell, 45(3):3461-3475.

[82]Li S, Gupta JK, Morales P, et al., 2021. Deep implicit coordination graphs for multi-agent reinforcement learning. Proc 20^th Int Conf on Autonomous Agents and Multiagent Systems, p.764-772.

[83]Li W, Liu WY, Shao ST, et al., 2023. AIIR-MIX: multi-agent reinforcement learning meets attention individual intrinsic reward mixing network. Proc 14^th Asian Conf on Machine Learning, p.579-594.

[84]Li W, Liu WY, Shao ST, et al., 2024. Attention-based intrinsic reward mixing network for credit assignment in multiagent reinforcement learning. IEEE Trans Games, 16(2):270-281.

[85]Li WH, Wang XF, Jin B, et al., 2022. Dealing with non-stationarity in MARL via trust-region decomposition. Proc 10^th Int Conf on Learning Representations.

[86]Li Y, Zhang S, Sun JC, et al., 2023a. Cooperative open-ended learning framework for zero-shot coordination. Proc 40^th Int Conf on Machine Learning, p.20470-20484.

[87]Li Y, Xiong K, Zhang YP, et al., 2023b. JiangJun: mastering Xiangqi by tackling non-transitivity in two-player zero-sum games. https://arxiv.org/abs/2308.04719

[88]Li Y, Zhang S, Sun JC, et al., 2024. Tackling cooperative incompatibility for zero-shot human-AI coordination. J Artif Intell Res, 80:1139-1185.

[89]Li Z, Wang QC, Wang JB, et al., 2024. A flexible cooperative MARL method for efficient passage of an emergency CAV in mixed traffic. IEEE Trans Intell Transp Syst, 25(8):8898-8912.

[90]Liu BY, Pu ZQ, Pan Y, et al., 2023. Lazy agents: a new perspective on solving sparse reward problem in multi-agent reinforcement learning. Proc 40^th Int Conf on Machine Learning, p.21937-21950.

[91]Liu HY, Li ZH, Huang KH, et al., 2024. Evolutionary reinforcement learning algorithm for large-scale multi-agent cooperation and confrontation applications. J Supercomput, 80(2):2319-2346.

[92]Liu IJ, Jain U, Yeh RA, et al., 2021. Cooperative exploration for multi-agent deep reinforcement learning. Proc 38^th Int Conf on Machine Learning, p.6826-6836.

[93]Lopes M, Lang T, Toussaint M, et al., 2012. Exploration in model-based reinforcement learning by empirically estimating learning progress. Proc 25^th Int Conf on Neural Information Processing Systems, p.206-214.

[94]Lowe R, Wu Y, Tamar A, et al., 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. Proc 31^st Int Conf on Neural Information Processing Systems, p.6382-6393.

[95]Ma YJ, Liang W, Wang GZ, et al., 2024. EUREKA: human-level reward design via coding large language models. Proc 12^th Int Conf on Learning Representations.

[96]Machado MC, Bellemare MG, Bowling M, 2020. Count-based exploration with the successor representation. Proc 34^th AAAI Conf on Artificial Intelligence, p.5125-5133.

[97]Mahajan A, Rashid T, Samvelyan M, et al., 2019. MAVEN: multi-agent variational exploration. Proc 33^rd Int Conf on Neural Information Processing Systems, Article 684.

[98]Mai V, Mani K, Paull L, 2022. Sample efficient deep reinforcement learning via uncertainty estimation. Proc 10^th Int Conf on Learning Representations.

[99]Mao HY, Wang C, Hao XT, et al., 2022. SEIHAI: a sample-efficient hierarchical AI for the MineRL competition. Proc 3^rd Int Conf on Distributed Artificial Intelligence, p.38-51.

[100]Mao HY, Zhao R, Chen H, et al., 2023. Transformer in Transformer as backbone for deep reinforcement learning.

[101]Medhi JK, Liu R, Wang QL, et al., 2023. Robust multiagent reinforcement learning for UAV systems: countering Byzantine attacks. Information, 14(11):623.

[102]Mguni DH, Jafferjee T, Wang JH, et al., 2022. LIGS: learnable intrinsic-reward generation selection for multi-agent learning. Proc 10^th Int Conf on Learning Representations.

[103]Miuccio L, Riolo S, Samarakoon S, et al., 2024. On learning generalized wireless MAC communication protocols via a feasible multi-agent reinforcement learning framework. IEEE Trans Mach Learn Commun Netw, 2:298-317.

[104]Mnih V, Kavukcuoglu K, Silver D, et al., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529-533.

[105]Nekoei H, Badrinaaraayanan A, Sinha A, et al., 2023. Dealing with non-stationarity in decentralized cooperative multi-agent deep reinforcement learning via multi-timescale learning. Proc 2^nd Conf on Lifelong Learning Agents, p.376-398.

[106]Ng AY, Harada D, Russell S, 1999. Policy invariance under reward transformations: theory and application to reward shaping. Proc 16^th Int Conf on Machine Learning, p.278-287.

[107]Nguyen D, Nguyen P, Venkatesh S, et al., 2022. Learning to transfer role assignment across team sizes. Proc 21^st Int Conf on Autonomous Agents and Multiagent Systems, p.963-971.

[108]Nian XH, Li MM, Wang HB, et al., 2024. Large-scale UAV swarm confrontation based on hierarchical attention actor-critic algorithm. Appl Intell, 54(4):3279-3294.

[109]Oroojlooy A, Hajinezhad D, 2023. A review of cooperative multi-agent deep reinforcement learning. Appl Intell, 53(11):13677-13722.

[110]Ostrovski G, Bellemare MG, van den Oord A, et al., 2017. Count-based exploration with neural density models. Proc 34^th Int Conf on Machine Learning, p.2721-2730.

[111]Pan XH, Liu M, Zhong FW, et al., 2022. MATE: benchmarking multi-agent reinforcement learning in distributed target coverage control. Proc 36^th Conf on Neural Information Processing Systems, p.27862-27879.

[112]Papoudakis G, Christianos F, Rahman A, et al., 2019. Dealing with non-stationarity in multi-agent deep reinforcement learning.

[113]Papoudakis G, Christianos F, Schäfer L, et al., 2022. Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks. Proc 35^th Conf on Neural Information Processing Systems.

[114]Park JS, O’Brien JC, Cai CJ, et al., 2023. Generative agents: interactive simulacra of human behavior. Proc 36^th Annual ACM Symp on User Interface Software and Technology, Article 2.

[115]Park S, Kim JP, Park C, et al., 2024. Quantum multi-agent reinforcement learning for autonomous mobility cooperation. IEEE Commun Mag, 62(6):106-112.

[116]Pathak D, Gandhi D, Gupta A, 2019. Self-supervised exploration via disagreement. Proc 36^th Int Conf on Machine Learning, p.5062-5071.

[117]Peng B, Rashid T, de Witt CS, et al., 2021. FACMAC: factored multi-agent centralised policy gradients. Proc 35^th Conf on Neural Information Processing Systems, p.12208-12221.

[118]Peng P, Wen Y, Yang YD, et al., 2017. Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play StarCraft combat games.

[119]Perez-Liebana D, Hofmann K, Mohanty SP, et al., 2019. The multi-agent reinforcement learning in MalmÖ (MARLÖ) competition.

[120]Perolat J, de Vylder B, Hennes D, et al., 2022. Mastering the game of Stratego with model-free multiagent reinforcement learning. Science, 378(6623):990-996.

[121]Pesce E, Montana G, 2020. Improving coordination in small-scale multi-agent deep reinforcement learning through memory-driven communication. Mach Learn, 109(9-10):1727-1747.

[122]Pu ZQ, Wang HM, Liu Z, et al., 2023. Attention enhanced reinforcement learning for multi agent cooperation. IEEE Trans Neur Netw Learn Syst, 34(11):8235-8249.

[123]Qiao WC, Huang M, Gao ZM, et al., 2024. Distributed dynamic pricing of multiple perishable products using multi-agent reinforcement learning. Exp Syst Appl, 237:121252.

[124]Qu GN, Lin YH, Wierman A, et al., 2020. Scalable multi-agent reinforcement learning for networked systems with average reward. Proc 34^th Int Conf on Neural Information Processing Systems, Article 175.

[125]Qu Y, Wang BY, Shao JZ, et al., 2024. Hokoff: real game dataset from Honor of Kings and its offline reinforcement learning benchmarks. Proc 37^th Int Conf on Neural Information Processing Systems, Article 974.

[126]Rădulescu R, Mannion P, Roijers DM, et al., 2020. Multi-objective multi-agent decision making: a utility-based analysis and survey. Auton Agent Multi-Agent Syst, 34(1):10.

[127]Rashid T, Samvelyan M, de Witt CS, et al., 2018. QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. Proc 35^th Int Conf on Machine Learning, p.4295-4304.

[128]Rashid T, Farquhar G, Peng B, et al., 2020. Weighted QMIX: expanding monotonic value function factorisation for deep multi-agent reinforcement learning. Proc 34^th Int Conf on Neural Information Processing Systems, Article 855.

[129]Ratzlaff N, Bai QX, Li FX, et al., 2020. Implicit generative modeling for efficient exploration. Proc 37^th Int Conf on Machine Learning, Article 740.

[130]Ren FY, Dong W, Zhao XD, et al., 2024. Two-layer coordinated reinforcement learning for traffic signal control in traffic network. Exp Syst Appl, 235:121111.

[131]Ren Y, Zhang H, Du LK, et al., 2024. Stealthy black-box attack with dynamic threshold against MARL-based traffic signal control system. IEEE Trans Ind Inform, 20(10):12021-12031.

[132]Resnick C, Eldridge W, Ha D, et al., 2018. Pommerman: a multi-agent playground. Proc 14^th AAAI Conf on Artificial Intelligence and Interactive Digital Entertainment.

[133]Rodriguez J, Koutsopoulos HN, Wang SH, et al., 2023. Cooperative bus holding and stop-skipping: a deep reinforcement learning framework. Transp Res Part C Emerg Technol, 155:104308.

[134]Roostaie S, Ebadzadeh MM, 2021. EnTRPO: trust region policy optimization method with entropy regularization.

[135]Samvelyan M, Rashid T, de Witt CS, et al., 2019. The StarCraft Multi-Agent Challenge. Proc 18^th Int Conf on Autonomous Agents and Multiagent Systems, p.2186-2188.

[136]Schulman J, Levine S, Moritz P, et al., 2015. Trust region policy optimization. Proc 32^nd Int Conf on Machine Learning, p.1889-1897.

[137]Shao JZ, Lou ZQ, Zhang HC, et al., 2022. Self-organized group for cooperative multi-agent reinforcement learning. Proc 36^th Int Conf on Neural Information Processing Systems, Article 413.

[138]Sharma A, Gu SX, Levine S, et al., 2020. Dynamics-aware unsupervised discovery of skills. Proc 8^th Int Conf on Learning Representations.

[139]She J, Gupta JK, Kochenderfer MJ, 2022. Agent-time attention for sparse rewards multi-agent reinforcement learning. Proc 21^st Int Conf on Autonomous Agents and Multiagent Systems, p.1723-1725.

[140]Shen RM, Zheng Y, Hao JY, et al., 2020. Generating behavior-diverse game AIs with evolutionary multi-objective deep reinforcement learning. Proc 29^th Int Joint Conf on Artificial Intelligence, p.3371-3377.

[141]Shen SQ, Qiu MW, Liu J, et al., 2022. ResQ: a residual Q function-based approach for multi-agent reinforcement learning value factorization. Proc 36^th Int Conf on Neural Information Processing Systems, Article 395.

[142]Shou ZY, Di X, 2020. Reward design for driver repositioning using multi-agent reinforcement learning. Transp Res Part C Emerg Technol, 119:102738.

[143]Singh S, Jaakkola T, Littman ML, et al., 2000. Convergence results for single-step on-policy reinforcement-learning algorithms. Mach Learn, 38:287-308.

[144]Singh S, Barto AG, Chentanez N, 2004. Intrinsically motivated reinforcement learning. Proc 17^th Int Conf on Neural Information Processing Systems, p.1281-1288.

[145]Son K, Kim D, Kang WJ, et al., 2019. QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning. Proc 36^th Int Conf on Machine Learning, p.5887-5896.

[146]Suay HB, Brys T, Taylor ME, et al., 2016. Learning from demonstration for shaping through inverse reinforcement learning. Proc Int Conf on Autonomous Agents and Multiagent Systems, p.429-437.

[147]Sukhbaatar S, Szlam A, Fergus R, 2016. Learning multi-agent communication with backpropagation. Proc 30^th Int Conf on Neural Information Processing Systems, p.2252-2260.

[148]Sunehag P, Lever G, Gruslys A, et al., 2017. Value-decomposition networks for cooperative multi-agent learning.

[149]Sutton RS, 1984. Temporal Credit Assignment in Reinforcement Learning. University of Massachusetts Amherst, Massachusetts, USA.

[150]Sutton RS, 1988. Learning to predict by the methods of temporal differences. Mach Learn, 3:9-44.

[151]Tang HR, Houthooft R, Foote D, et al., 2017. #Exploration: a study of count-based exploration for deep reinforcement learning. Proc 31^st Int Conf on Neural Information Processing Systems, p.2750-2759.

[152]Tian Q, Kuang K, Liu FR, et al., 2023. Learning from good trajectories in offline multi-agent reinforcement learning. Proc 37^th AAAI Conf on Artificial Intelligence, p.11672-11680.

[153]Vanneste S, Vanneste A, Mets K, et al., 2020. Learning to communicate using counterfactual reasoning.

[154]Wang BL, Gao XZ, Xie T, 2024. An evolutionary multi-agent reinforcement learning algorithm for multi-UAV air combat. Knowl-Based Syst, 299:112000.

[155]Wang ES, Liu F, Hong C, et al., 2024. MADRL-based UAV swarm non-cooperative game under incomplete information. Chin J Aeronaut, 37(6):293-306.

[156]Wang JH, Xu WK, Gu YJ, et al., 2021a. Multi-agent reinforcement learning for active voltage control on power distribution networks. Proc 35^th Int Conf on Neural Information Processing Systems, Article 250.

[157]Wang JH, Ren ZZ, Liu T, et al., 2021b. QPLEX: duplex dueling multi-agent Q-learning. Proc 9^th Int Conf on Learning Representations.

[158]Wang JH, Ren ZZ, Han BN, et al., 2021c. Towards understanding cooperative multi-agent Q-learning with value factorization. Proc 35^th Conf on Neural Information Processing Systems, p.29142-29155.

[159]Wang JR, Hong YT, Wang JL, et al., 2022. Cooperative and competitive multi-agent systems: from optimization to games. IEEE/CAA J Autom Sin, 9(5):763-783.

[160]Wang L, Zhang YP, Hu YJ, et al., 2022. Individual reward assisted multi-agent reinforcement learning. Proc 39^th Int Conf on Machine Learning, p.23417-23432.

[161]Wang SY, Chen WY, Hu J, et al., 2022. Noise-regularized advantage value for multi-agent reinforcement learning. Mathematics, 10(15):2728.

[162]Wang TH, Wang JH, Zheng CY, et al., 2020a. Learning nearly decomposable value functions via communication minimization. Proc 8^th Int Conf on Learning Representations.

[163]Wang TH, Dong H, Lesser V, et al., 2020b. ROMA: multi-agent reinforcement learning with emergent roles. Proc 37^th Int Conf on Machine Learning, p.9876-9886.

[164]Wang TH, Gupta T, Mahajan A, et al., 2021. RODE: learning roles to decompose multi-agent tasks. Proc 9^th Int Conf on Learning Representations.

[165]Wang TH, Zeng L, Dong WJ, et al., 2022. Context-aware sparse deep coordination graphs. Proc 10^th Int Conf on Learning Representations.

[166]Wang WX, Yang TP, Liu Y, et al., 2020. Action semantics network: considering the effects of actions in multiagent systems. Proc 8^th Int Conf on Learning Representations.

[167]Wang XH, Tian Z, Wan ZY, et al., 2023. Order matters: agent-by-agent policy optimization. Proc 11^th Int Conf on Learning Representations.

[168]Wang YT, Sartoretti G, 2022. FCMNet: full communication memory net for team-level cooperation in multi-agent systems. Proc 21^st Int Conf on Autonomous Agents and Multiagent Systems, p.1355-1363.

[169]Wang YX, Zeng ZX, Zhao QJ, 2023. Evaluating the perceived safety of urban city via maximum entropy deep inverse reinforcement learning. Proc 14^th Asian Conf on Machine Learning, p.1085-1100.

[170]Wang ZH, Cai SF, Chen GZ, et al., 2024. Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents.

[171]Wen MN, Kuba J, Lin RJ, et al., 2022. Multi-agent reinforcement learning is a sequence modeling problem. Proc 36^th Int Conf on Neural Information Processing Systems, Article 1201.

[172]White CCIII, White DJ, 1989. Markov decision processes. Eur J Oper Res, 39(1):1-16.

[173]Wiewiora E, 2003. Potential-based shaping and Q-value initialization are equivalent. J Artif Intell Res, 19:205-208.

[174]Wiewiora E, Cottrell G, Elkan C, 2003. Principled methods for advising reinforcement learning agents. Proc 20^th Int Conf on Machine Learning, p.792-799.

[175]Wu T, Zhou P, Liu K, et al., 2020. Multi-agent deep reinforcement learning for urban traffic light control in vehicular networks. IEEE Trans Veh Technol, 69(8):8243-8256.

[176]Wu ZF, Yu C, Ye DC, et al., 2021. Coordinated proximal policy optimization. Proc 35^th Conf on Neural Information Processing Systems, p.26437-26448.

[177]Xiao BC, Ramasubramanian B, Poovendran R, 2022. Agent-temporal attention for reward redistribution in episodic multi-agent reinforcement learning. Proc 21^st Int Conf on Autonomous Agents and Multiagent Systems, p.1391-1399.

[178]Xiao BD, Li RP, Wang F, et al., 2024. Stochastic graph neural network-based value decomposition for MARL in Internet of Vehicles. IEEE Trans Veh Technol, 73(2):1582-1596.

[179]Xiao J, Yuan GH, He JH, et al., 2023. Graph attention mechanism based reinforcement learning for multi-agent flocking control in communication-restricted environment. Inform Sci, 620:142-157.

[180]Xu P, Zhang JG, Yin QY, et al., 2023. Subspace-aware exploration for sparse-reward multi-agent tasks. Proc 37^th AAAI Conf on Artificial Intelligence, p.11717-11725.

[181]Xu X, Jia Y, Xu Y, et al., 2020. A multi-agent reinforcement learning-based data-driven method for home energy management. IEEE Trans Smart Grid, 11(4):3201-3211.

[182]Xu YZ, Wang S, Li P, et al., 2024. Exploring large language models for communication games: an empirical study on Werewolf.

[183]Xu ZW, Zhang B, LI DP, et al., 2023. Dual self-awareness value decomposition framework without individual global max for cooperative MARL. Proc 37^th Conf on Neural Information Processing Systems, p.73898-73918.

[184]Yang NK, Han LJ, Liu R, et al., 2023. Multiobjective intelligent energy management for hybrid electric vehicles based on multiagent reinforcement learning. IEEE Trans Transp Electrif, 9(3):4294-4305.

[185]Yang TP, Wang WX, Tang HY, et al., 2021. An efficient transfer learning framework for multiagent reinforcement learning. Proc 35^th Int Conf on Neural Information Processing Systems, Article 1302.

[186]Yang YD, Wen Y, Chen LH, et al., 2020a. Multi-agent determinantal Q-learning. Proc 37^th Int Conf on Machine Learning, Article 997.

[187]Yang YD, Hao JY, Liao B, et al., 2020b. Qatten: a general framework for cooperative multiagent reinforcement learning.

[188]Yang YD, Hao JY, Chen GY, et al., 2020c. Q-value path decomposition for deep multiagent reinforcement learning. Proc 37^th Int Conf on Machine Learning, Article 992.

[189]Yang Z, Moerland TM, Preuss M, et al., 2022. When to go, and when to explore: the benefit of post-exploration in intrinsic motivation.

[190]Ye JN, Li CH, Wang JH, et al., 2023. Towards global optimality in cooperative MARL with the transformation and distillation framework.

[191]Yeh RA, Schwing AG, Huang J, et al., 2019. Diverse generation for multi-agent sports games. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4605-4614.

[192]Yi YX, Li G, Wang YW, et al., 2022. Learning to share in multi-agent reinforcement learning. Proc 36^th Int Conf on Neural Information Processing Systems, Article 1100.

[193]Yu C, Velu A, Vinitsky E, et al., 2022. The surprising effectiveness of PPO in cooperative multi-agent games. Proc 36^th Int Conf on Neural Information Processing Systems, Article 1787.

[194]Yuan TT, Chung HM, Yuan J, et al., 2023. DACOM: learning delay-aware communication for multi-agent reinforcement learning. Proc 37^th AAAI Conf on Artificial Intelligence, p.11763-11771.

[195]Yuan WL, Chen JX, Chen SF, et al., 2024. Transformer in reinforcement learning for decision-making: a survey. Front Inform Technol Electron Eng, 25(6):763-790.

[196]Zang YF, He JM, Li K, et al., 2023. Automatic grouping for efficient cooperative multi-agent reinforcement learning. Proc 37^th Conf on Neural Information Processing Systems, p.46105-46121.

[197]Zeng SL, Chen TY, Garcia A, et al., 2022. Learning to coordinate in multi-agent systems: a coordinated actor-critic algorithm and finite-time guarantees. Proc 4^th Annual Learning for Dynamics and Control Conf, p.278-290.

[198]Zha DC, Xie JR, Ma WY, et al., 2021. DouZero: mastering DouDizhu with self-play deep reinforcement learning. Proc 38^th Int Conf on Machine Learning, p.12333-12344.

[199]Zhang HC, Li GZ, Liu CH, et al., 2023. HiMacMic: hierarchical multi-agent deep reinforcement learning with dynamic asynchronous macro strategy. Proc 29^th ACM SIGKDD Conf on Knowledge Discovery and Data Mining, p.3239-3248.

[200]Zhang KQ, Yang ZR, Başar T, 2021. Decentralized multi-agent reinforcement learning with networked agents: recent advances. Front Inform Technol Electron Eng, 22(6):802-814.

[201]Zhang M, Zhang SH, Yang ZJ, et al., 2023. GoBigger: a scalable platform for cooperative–competitive multi-agent interactive simulation. Proc 11^th Int Conf on Learning Representations.

[202]Zhang NM, Shen YL, Du Y, et al., 2023. Counterfactual-attention multi-agent reinforcement learning for joint condition-based maintenance and production scheduling. J Manuf Syst, 71:70-81.

[203]Zhang TJ, Xu HZ, Wang XL, et al., 2020. Multi-agent collaboration via reward attribution decomposition.

[204]Zhang Y, Yang QY, An D, et al., 2021. Coordination between individual agents in multi-agent reinforcement learning. Proc 35^th AAAI Conf on Artificial Intelligence, p.11387-11394.

[205]Zhang ZQ, Yuan L, Li LH, et al., 2023. Fast teammate adaptation in the presence of sudden policy change. Proc 39^th Conf on Uncertainty in Artificial Intelligence, p.2465-2476.

[206]Zhao J, Zhao YP, Wang WX, et al., 2022. Coach-assisted multi-agent reinforcement learning framework for unexpected crashed agents. Front Inform Technol Electron Eng, 23(7):1032-1042.

[207]Zhao LY, Chang TQ, Chu KX, et al., 2023. Survey of fully cooperative multi-agent deep reinforcement learning. Comput Eng Appl, 59(12):14-27 (in Chinese).

[208]Zhao XY, Holden SB, 2022. Towards a competitive 3-player Mahjong AI using deep reinforcement learning. Proc IEEE Conf on Games, p.524-527.

[209]Zheng LL, Chen JR, Wang JH, et al., 2021. Episodic multi-agent reinforcement learning with curiosity-driven exploration. Proc 35^th Int Conf on Neural Information Processing Systems, Article 287.

[210]Zheng LM, Yang JC, Cai H, et al., 2018. MAgent: a many-agent reinforcement learning platform for artificial collective intelligence. Proc 32^nd AAAI Conf on Artificial Intelligence, p.8222-8223.

[211]Zheng Y, Xie XF, Su T, et al., 2019. Wuji: automatic online combat game testing using evolutionary deep reinforcement learning. Proc 34^th IEEE/ACM Int Conf on Automated Software Engineering, p.772-784.

[212]Zhou YM, Yang F, Zhang CY, et al., 2024. Cooperative decision-making algorithm with efficient convergence for UCAV formation in beyond-visual-range air combat based on multi-agent reinforcement learning. Chin J Aeronaut, 37(8):311-328.

[213]Zhu XZ, Chen YT, Tian H, et al., 2023. Ghost in the Minecraft: generally capable agents for open-world environments via large language models with text-based knowledge and memory.

[214]Zhuang ZF, Lei K, Liu JX, et al., 2023. Behavior proximal policy optimization. Proc 11^th Int Conf on Learning Representations.

[215]Zohar R, Mannor S, Tennenholtz G, 2022. Locality matters: a scalable value decomposition approach for cooperative multi-agent reinforcement learning. Proc 36^th AAAI Conf on Artificial Intelligence, p.9278-9285.

[216]Zou HS, Ren TZ, Yan D, et al., 2019. Reward shaping via meta-learning.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

完全合作场景中的优化方法：多智能体强化学习综述

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference