CLC number: TP18
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2021-04-29
Cited: 0
Clicked: 6167
Kaiqing Zhang, Zhuoran Yang, Tamer Başar. Decentralized multi-agent reinforcement learning with networked agents: recent advances[J]. Frontiers of Information Technology & Electronic Engineering, 2021, 22(6): 802-814.
@article{title="Decentralized multi-agent reinforcement learning with networked agents: recent advances",
author="Kaiqing Zhang, Zhuoran Yang, Tamer Başar",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="22",
number="6",
pages="802-814",
year="2021",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1900661"
}
%0 Journal Article
%T Decentralized multi-agent reinforcement learning with networked agents: recent advances
%A Kaiqing Zhang
%A Zhuoran Yang
%A Tamer Başar
%J Frontiers of Information Technology & Electronic Engineering
%V 22
%N 6
%P 802-814
%@ 2095-9184
%D 2021
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1900661
TY - JOUR
T1 - Decentralized multi-agent reinforcement learning with networked agents: recent advances
A1 - Kaiqing Zhang
A1 - Zhuoran Yang
A1 - Tamer Başar
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 22
IS - 6
SP - 802
EP - 814
%@ 2095-9184
Y1 - 2021
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1900661
Abstract: Multi-agent reinforcement learning (MARL) has long been a significant research topic in both machine learning and control systems. Recent development of (single-agent) deep reinforcement learning has created a resurgence of interest in developing new MARL algorithms, especially those founded on theoretical analysis. In this paper, we review recent advances on a sub-area of this topic: decentralized MARL with networked agents. In this scenario, multiple agents perform sequential decision-making in a common environment, and without the coordination of any central controller, while being allowed to exchange information with their neighbors over a communication network. Such a setting finds broad applications in the control and operation of robots, unmanned vehicles, mobile sensor networks, and the smart grid. This review covers several of our research endeavors in this direction, as well as progress made by other researchers along the line. We hope that this review promotes additional research efforts in this exciting yet challenging area.
[1]Adler JL, Blue VJ, 2002. A cooperative multi-agent transportation management and route guidance system. Transp Res Part C Emerg Technol, 10(5-6):433-454.
[2]Agarwal A, Duchi JC, 2011. Distributed delayed stochastic optimization. Proc 24th Int Conf on Neural Information Processing Systems, p.873-881.
[3]Antos A, Szepesvári C, Munos R, 2008a. Fitted Q-iteration in continuous action-space MDPs. Advances in Neural Information Processing Systems, p.9-16.
[4]Antos A, Szepesvári C, Munos R, 2008b. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Mach Learn, 71(1):89-129.
[5]Assran M, Romoff J, Ballas N, et al., 2019. Gossip-based actor-learner architectures for deep reinforcement learning. Advances in Neural Information Processing Systems, p.13299-13309.
[6]Bacsar T, Olsder GJ, 1999. Dynamic Noncooperative Game Theory. SIAM, Philadelphia.
[7]Baxter J, Bartlett PL, 2001. Infinite-horizon policy-gradient estimation. J Artif Intell Res, 15:319-350.
[8]Bertsekas D, 2019. Multiagent rollout algorithms and reinforcement learning. https://arxiv.org/abs/1910.00120
[9]Bertsekas DP, 2005. Dynamic Programming and Optimal Control. Athena Scientific, Belmont, MA, USA.
[10]Bhandari J, Russo D, Singal R, 2018. A finite time analysis of temporal difference learning with linear function approximation. Proc 31st Conf on Learning Theory, p.1691-1692.
[11]Bhatnagar S, Sutton RS, Ghavamzadeh M, et al., 2009. Natural actor-critic algorithms. Automatica, 45(11):2471-2482.
[12]Borkar VS, 2008. Stochastic Approximation: a Dynamical Systems Viewpoint. Cambridge University Press, Cambridge, UK.
[13]Boutilier C, 1996. Planning, learning and coordination in multiagent decision processes. Proc 6th Conf on Theoretical Aspects of Rationality and Knowledge, p.195-210.
[14]Boyd S, Parikh N, Chu E, et al., 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends® Mach Learn, 3(1):1-122.
[15]Busoniu L, Babuska R, de Schutter B, et al., 2008. A comprehensive survey of multiagent reinforcement learning. IEEE Trans Syst Man Cybern Part C Appl Rev, 38(2):156-172.
[16]Cassano L, Yuan K, Sayed AH, 2018. Multi-agent fully decentralized value function learning with linear convergence rates. https://arxiv.org/abs/1810.07792
[17]Cassano L, Alghunaim SA, Sayed AH, 2019. Team policy learning for multi-agent reinforcement learning. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.3062-3066.
[18]Chen TY, Zhang KQ, Giannakis GB, et al., 2018. Communication-efficient distributed reinforcement learning. https://arxiv.org/abs/1812.03239
[19]Ciosek K, Whiteson S, 2018. Expected policy gradients for reinforcement learning. https://arxiv.org/abs/1801.03326
[20]Corke P, Peterson R, Rus D, 2005. Networked robots: flying robot navigation using a sensor net. In: Dario P, Chatila R (Eds.), Robotics Research. Springer, Berlin, p.234-243.
[21]Dall’Anese E, Zhu H, Giannakis GB, 2013. Distributed optimal power flow for smart microgrids. IEEE Trans Smart Grid, 4(3):1464-1475.
[22]Ding DS, Wei XH, Yang ZR, et al., 2019. Fast multi-agent temporal-difference learning via homotopy stochastic primal-dual optimization. https://arxiv.org/abs/1908.02805
[23]Doan TT, Maguluri S, Romberg J, 2019a. Finite-time analysis of distributed TD(0) with linear function approximation for multi-agent reinforcement learning. Proc 36th Int Conf on Machine Learning, p.1626-1635.
[24]Doan TT, Maguluri ST, Romberg J, 2019b. Finite-time performance of distributed temporal difference learning with linear function approximation. https://arxiv.org/abs/1907.12530
[25]Fan JQ, Tong X, Zeng Y, 2015. Multi-agent inference in social networks: a finite population learning approach. J Am Stat Assoc, 110(509):149-158.
[26]Farahmand AM, Munos R, Szepesvári C, 2010. Error propagation for approximate policy and value iteration. Advances in Neural Information Processing Systems, p.568-576.
[27]Foerster JN, Assael YM, de Freitas N, et al., 2016. Learning to communicate with deep multi-agent reinforcement learning. Proc 30th Int Conf on Neural Information Processing Systems, p.2137-2145.
[28]Gupta JK, Egorov M, Kochenderfer M, 2017. Cooperative multi-agent control using deep reinforcement learning. Int Conf on Autonomous Agents and Multiagent Systems, p.66-83.
[29]Hong MY, Chang TH, 2017. Stochastic proximal gradient consensus over random networks. IEEE Trans Signal Process, 65(11):2933-2948.
[30]Jakovetic D, Xavier J, Moura JMF, 2011. Cooperative convex optimization in networked systems: augmented Lagrangian algorithms with directed gossip communication. IEEE Trans Signal Process, 59(8):3889-3902.
[31]Kar S, Moura JMF, 2013. Consensus + innovations distributed inference over networks: cooperation and sensing in networked systems. IEEE Signal Process Mag, 30(3):99-109.
[32]Kar S, Moura JMF, Poor HV, 2013. QD-learning: a collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations. IEEE Trans Signal Process, 61(7):1848-1862.
[33]Kober J, Bagnell JA, Peters J, 2013. Reinforcement learning in robotics: a survey. Int J Rob Res, 32(11):1238-1274.
[34]Konda VR, Tsitsiklis JN, 1999. Actor-critic algorithms. Advances in Neural Information Processing Systems, p.1008-1014.
[35]Lange S, Gabel T, Riedmiller M, 2012. Batch reinforcement learning. In: Wiering M, van Otterlo M (Eds.), Reinforcement Learning. Adaptation, Learning, and Optimization. Springer, Berlin, Heidelberg.
[36]Lauer M, Riedmiller MA, 2000. An algorithm for distributed reinforcement learning in cooperative multi-agent systems. Proc 17th Int Conf on Machine Learning, p.535-542.
[37]Lee D, Yoon H, Hovakimyan N, 2018. Primal-dual algorithm for distributed reinforcement learning: distributed GTD. IEEE Conf on Decision and Control, p.1967-1972.
[38]Lillicrap TP, Hunt JJ, Pritzel A, et al., 2016. Continuous control with deep reinforcement learning. Proc 4th Int Conf on Learning Representations.
[39]Lin YX, Zhang KQ, Yang ZR, et al., 2019. A communication-efficient multi-agent actor-critic algorithm for distributed reinforcement learning. Proc IEEE 58th Conf on Decision and Control, p.5562-5567.
[40]Littman ML, 1994. Markov games as a framework for multi-agent reinforcement learning. Proc 11th Int Conf on Machine Learning, p.157-163.
[41]Liu B, Liu J, Ghavamzadeh M, et al., 2015. Finite-sample analysis of proximal gradient TD algorithms. Proc 31st Conf on Uncertainty in Artificial Intelligence, p.504-513.
[42]Lowe R, Wu Y, Tamar A, et al., 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. Proc 31st Int Conf on Neural Information Processing Systems, p.6379-6390.
[43]Macua SV, Chen JS, Zazo S, et al., 2015. Distributed policy evaluation under multiple behavior strategies. IEEE Trans Autom Contr, 60(5):1260-1274.
[44]Macua SV, Tukiainen A, Hernández DGO, et al., 2017. Diff-DAC: distributed actor-critic for average multitask deep reinforcement learning. https://arxiv.org/abs/1710.10363
[45]Mahajan A, Teneketzis D, 2008. Sequential Decomposition of Sequential Dynamic Teams: Applications to Real-Time Communication and Networked Control Systems. University of Michigan, Ann Arbor, USA.
[46]Meai HR, Szepesvári C, Bhatnagar S, et al., 2009. Convergent temporal-difference learning with arbitrary smooth function approximation. Proc 22nd Int Conf on Neural Information Processing Systems, p.1204-1212.
[47]Mnih V, Kavukcuoglu K, Silver D, et al., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529-533.
[48]Munos R, 2007. Performance bounds in Lp-norm for approximate value iteration. SIAM J Contr Optim, 46(2):541-561.
[49]Munos R, Szepesvári C, 2008. Finite-time bounds for fitted value iteration. J Mach Learn Res, 9:815-857.
[50]Nedić A, Ozdaglar A, 2009. Distributed subgradient methods for multi-agent optimization. IEEE Trans Autom Contr, 54(1):48-61.
[51]Nedić A, Olshevsky A, Shi W, 2017. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J Optim, 27(4):2597-2633.
[52]Oliehoek FA, Amato C, 2016. A Concise Introduction to Decentralized POMDPs. Springer, Cham.
[53]Omidshafiei S, Pazis J, Amato C, et al., 2017. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. Proc 34th Int Conf on Machine Learning, p.2681-2690.
[54]Pennesi P, Paschalidis IC, 2010. A distributed actor-critic algorithm and applications to mobile sensor network coordination problems. IEEE Trans Autom Contr, 55(2):492-497.
[55]Qie H, Shi DX, Shen TL, et al., 2019. Joint optimization of multi-UAV target assignment and path planning based on multi-agent reinforcement learning. IEEE Access, 7:146264-146272.
[56]Qu GN, Li N, 2018. Harnessing smoothness to accelerate distributed optimization. IEEE Trans Contr Netw Syst, 5(3):1245-1260.
[57]Rabbat M, Nowak R, 2004. Distributed optimization in sensor networks. Proc 3rd Int Symp on Information Processing in Sensor Networks, p.20-27.
[58]Ren J, Haupt J, 2019. A communication efficient hierarchical distributed optimization algorithm for multi-agent reinforcement learning. Real-World Sequential Decision Making Workshop at Int Conf on Machine Learning.
[59]Riedmiller M, 2005. Neural fitted Q iteration—first experiences with a data efficient neural reinforcement learning method. Proc 16th European Conf on Machine Learning, p.317-328.
[60]Sayed AH, 2014. Adaptation, learning, and optimization over networks. Found Trends® Mach Learn, 7(4-5):311-801.
[61]Schmidt M, Le Roux N, Bach F, 2017. Minimizing finite sums with the stochastic average gradient. Math Program, 162(1-2):83-112.
[62]Sha XY, Zhang JQ, Zhang KQ, et al., 2020. Asynchronous policy evaluation in distributed reinforcement learning over networks. https://arxiv.org/abs/2003.00433
[63]Shalev-Shwartz S, Shammah S, Shashua A, 2016. Safe, multi-agent, reinforcement learning for autonomous driving. https://arxiv.org/abs/1610.03295
[64]Shapley LS, 1953. Stochastic games. PNAS, 39(10):1095-1100.
[65]Shi W, Ling Q, Wu G, et al., 2015. Extra: an exact first-order algorithm for decentralized consensus optimization. SIAM J Optim, 25(2):944-966.
[66]Silver D, Lever G, Heess N, et al., 2014. Deterministic policy gradient algorithms. Proc 31st Int Conf on Machine Learning, p.387-395.
[67]Silver D, Huang A, Maddison CJ, et al., 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484-489.
[68]Silver D, Schrittwieser J, Simonyan K, et al., 2017. Mastering the game of Go without human knowledge. Nature, 550(7676):354-359.
[69]Singh S, Jaakkola T, Littman ML, et al., 2000. Convergence results for single-step on-policy reinforcement-learning algorithms. Mach Learn, 38(3):287-308.
[70]Singh SP, Sutton RS, 1996. Reinforcement learning with replacing eligibility traces. Mach Learn, 22(1-3):123-158.
[71]Srikant R, Ying L, 2019. Finite-time error bounds for linear stochastic approximation and TD learning. Proc 32nd Conf on Learning Theory, p.2803-2830.
[72]Stanković MS, Stanković SS, 2016. Multi-agent temporal-difference learning with linear function approximation: weak convergence under time-varying network topologies. American Control Conf, p.167-172.
[73]Stanković MS, Ilić N, Stanković SS, 2016. Distributed stochastic approximation: weak convergence and network design. IEEE Trans Autom Contr, 61(12):4069-4074.
[74]Suttle W, Yang ZR, Zhang KQ, et al., 2019. A multi-agent off-policy actor-critic algorithm for distributed reinforcement learning. https://arxiv.org/abs/1903.06372
[75]Sutton RS, McAllester DA, Singh SP, et al., 2000. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, p.1057-1063.
[76]Sutton RS, Szepesvári C, Maei HR, 2008. A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. Proc 21st Int Conf on Neural Information Processing Systems, p.1609-1616.
[77]Sutton RS, Maei HR, Precup D, et al., 2009. Fast gradient-descent methods for temporal-difference learning with linear function approximation. Proc 26th Annual Int Conf on Machine Learning, p.993-1000.
[78]Sutton RS, Mahmood AR, White M, 2016. An emphatic approach to the problem of off-policy temporal-difference learning. J Mach Learn Res, 17(1):2603-2631.
[79]Tesauro G, 1995. Temporal difference learning and TD-Gammon. Commun ACM, 38(3):58-68.
[80]Tsitsiklis JN, van Roy B, 1997. Analysis of temporal-diffference learning with function approximation. Advances in Neural Information Processing Systems, p.1075-1081.
[81]Tu SY, Sayed AH, 2012. Diffusion strategies outperform consensus strategies for distributed estimation over adaptive networks. IEEE Trans Signal Process, 60(12):6217-6234.
[82]Varshavskaya P, Kaelbling LP, Rus D, 2009. Efficient distributed reinforcement learning through agreement. In: Asama H, Kurokawa H, Ota J, et al. (Eds.), Distributed Autonomous Robotic Systems. Springer, Berlin, p.367-378.
[83]Wai HT, Yang Z, Wang ZR, et al., 2018. Multi-agent reinforcement learning via double averaging primal-dual optimization. Advances in Neural Information Processing Systems, p.9649-9660.
[84]Wang XF, Sandholm T, 2003. Reinforcement learning to play an optimal Nash equilibrium in team Markov games. Proc 15th Int Conf on Neural Information Processing Systems, p.1603-1610.
[85]Watkins CJCH, Dayan P, 1992. Q-learning. Mach Learn, 8(3-4):279-292.
[86]Williams RJ, 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn, 8(3-4):229-256.
[87]Xiao L, Boyd S, Kim SJ, 2007. Distributed average consensus with least-mean-square deviation. J Parall Distrib Comput, 67(1):33-46.
[88]Ying BC, Yuan K, Sayed AH, 2018. Convergence of variance-reduced learning under random reshuffling. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.2286-2290.
[89]Yu HZ, 2015. On convergence of emphatic temporal-difference learning. Proc 28th Conf on Learning Theory, p.1724-1751.
[90]Zazo S, Macua SV, Sánchez-Fernández M, et al., 2016. Dynamic potential games with constraints: fundamentals and applications in communications. IEEE Trans Signal Process, 64(14):3806-3821.
[91]Zhang HG, Jiang H, Luo YH, et al., 2017. Data-driven optimal consensus control for discrete-time multi-agent systems with unknown dynamics using reinforcement learning method. IEEE Trans Ind Electron, 64(5):4091-4100.
[92]Zhang KQ, Lu LQ, Lei C, et al., 2018a. Dynamic operations and pricing of electric unmanned aerial vehicle systems and power networks. Transp Res Part C Emerg Technol, 92:472-485.
[93]Zhang KQ, Yang ZR, Liu H, et al., 2018b. Finite-sample analyses for fully decentralized multi-agent reinforcement learning. https://arxiv.org/abs/1812.02783v5
[94]Zhang KQ, Yang ZR, Liu H, et al., 2018c. Fully decentralized multi-agent reinforcement learning with networked agents. Proc 35th Int Conf on Machine Learning, p.5867-5876.
[95]Zhang KQ, Yang ZR, Bacsar T, 2018d. Networked multi-agent reinforcement learning in continuous spaces. IEEE Conf on Decision and Control, p.2771-2776.
[96]Zhang KQ, Yang ZR, Bacsar T, 2019. Multi-agent reinforcement learning: a selective overview of theories and algorithms. https://arxiv.org/abs/1911.10635
[97]Zhang QC, Zhao DB, Lewis FL, 2018. Model-free reinforcement learning for fully cooperative multi-agent graphical games. Int Joint Conf on Neural Networks, p.1-6.
[98]Zhang Y, Zavlanos MM, 2019. Distributed off-policy actor-critic reinforcement learning with policy consensus. https://arxiv.org/abs/1903.09255
Open peer comments: Debate/Discuss/Question/Opinion
<1>