CLC number: TP242
On-line Access: 2019-05-14
Received: 2018-09-15
Revision Accepted: 2018-11-27
Crosschecked: 2019-04-28
Cited: 0
Clicked: 6527
Li-dong Zhang, Ban Wang, Zhi-xiang Liu, You-min Zhang, Jian-liang Ai. Motion planning of a quadrotor robot game using a simulation-based projected policy iteration method[J]. Frontiers of Information Technology & Electronic Engineering, 2019, 20(4): 525-537.
@article{title="Motion planning of a quadrotor robot game using a simulation-based projected policy iteration method",
author="Li-dong Zhang, Ban Wang, Zhi-xiang Liu, You-min Zhang, Jian-liang Ai",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="20",
number="4",
pages="525-537",
year="2019",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1800571"
}
%0 Journal Article
%T Motion planning of a quadrotor robot game using a simulation-based projected policy iteration method
%A Li-dong Zhang
%A Ban Wang
%A Zhi-xiang Liu
%A You-min Zhang
%A Jian-liang Ai
%J Frontiers of Information Technology & Electronic Engineering
%V 20
%N 4
%P 525-537
%@ 2095-9184
%D 2019
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1800571
TY - JOUR
T1 - Motion planning of a quadrotor robot game using a simulation-based projected policy iteration method
A1 - Li-dong Zhang
A1 - Ban Wang
A1 - Zhi-xiang Liu
A1 - You-min Zhang
A1 - Jian-liang Ai
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 20
IS - 4
SP - 525
EP - 537
%@ 2095-9184
Y1 - 2019
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1800571
Abstract: Making rational decisions for sequential decision problems in complex environments has been challenging researchers in various fields for decades. Such problems consist of state transition dynamics, stochastic uncertainties, long-term utilities, and other factors that assemble high barriers including the curse of dimensionality. Recently, the state-of-the-art algorithms in reinforcement learning studies have been developed, providing a strong potential to efficiently break the barriers and make it possible to deal with complex and practical decision problems with decent performance. We propose a formulation of a velocity varying one-on-one quadrotor robot game problem in the three-dimensional space and an approximate dynamic programming approach using a projected policy iteration method for learning the utilities of game states and improving motion policies. In addition, a simulation-based iterative scheme is employed to overcome the curse of dimensionality. Simulation results demonstrate that the proposed decision strategy can generate effective and efficient motion policies that can contend with the opponent quadrotor and gather advantaged status during the game. Flight experiments, which are conducted in the Networked Autonomous Vehicles (NAV) Lab at the Concordia University, have further validated the performance of the proposed decision strategy in the real-time environment.
[1]Ballard BW, 1983. The •-minimax search procedure for trees containing chance nodes. Artif Intell, 21(3):327-350.
[2]Bellman R, 1952. On the theory of dynamic programming. Proc Nat Acad Sci, 38(8):716-719.
[3]Bertsekas DP, 1971. Dynamic Programming and Optimal Control. Athena Scientific, Belmont, Massachusetts, USA.
[4]Bertsekas DP, 2007. Dynamic Programming and Optimal Control (3rd Ed.) Athena Scientific, Belmont, Massachusetts, USA.
[5]Bertsekas DP, 2011. Temporal difference methods for general projected equations. IEEE Trans Autom Contr, 56(9):2128-2139.
[6]Bertsekas DP, 2012. Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. In: Suvrit Sra SN, Wright SJ (Eds.), Optimization for Machine Learning. MIT Press, Massachusetts, USA.
[7]Bertsekas DP, 2015. Lambda-policy iteration: a review and a new implementation. https://arxiv.org/abs/1507.01029
[8]Bertsekas DP, Tsitsiklis JN, 2000. Gradient convergence in gradient methods with errors. SIAM J Optim, 10(3):linebreak627-642.
[9]Buc{s}oniu L, Ernst D, de Schutter B, et al., 2010. Online least-squares policy iteration for reinforcement learning control. Proc American Control Conf, p.486-491.
[10]Efroni Y, Dalal G, Scherrer B, et al., 2018a. Beyond the one step greedy approach in reinforcement learning. https://arxiv.org/abs/1802.03654
[11]Efroni Y, Dalal G, Scherrer B, et al., 2018b. Multiple-step greedy policies in online and approximate reinforcement learning. https://arxiv.org/abs/1805.07956
[12]Fang J, Zhang LM, Fang W, et al., 2016. Approximate dynamic programming for CGF air combat maneuvering decision. 2nd IEEE Int Conf on Computer and Communications, p.1386-1390.
[13]Ghamry KA, Dong YQ, Kamel MA, et al., 2016. Real-time autonomous take-off, tracking and landing of UAV on a moving UGV platform. 24th Mediterranean Conf on Control and Automation, p.1236-1241.
[14]Hastie T, Tibshirani R, Friedman J, 2001. The Elements of Statistical Learning. Springer, New York, USA.
[15]Hauk T, Buro M, Schaeffer J, 2004. Rediscovering •-minimax search. Int Conf on Computers and Games, p.35-50.
[16]Liu ZX, Zhang YM, Yu X, et al., 2016. Unmanned surface vehicles: an overview of developments and challenges. Ann Rev Contr, 41:71-93.
[17]Ma YF, Ma XL, Song X, 2014. A case study on air combat decision using approximated dynamic programming. Math Probl Eng, 2014:183401.
[18]McGrew JS, 2008. Real-Time Maneuvering Decisions for Autonomous Air Combat. MS Thesis, Massachusetts Institute of Technology, Massachusetts, USA.
[19]McGrew JS, How JP, Williams B, et al., 2010. Air-combat strategy using approximate dynamic programming. J Guid Contr Dynam, 33(5):1641-1654.
[20]Powell WB, 2007. Approximate Dynamic Programming: Solving the Curses of Dimensionality. John Wiley & Sons, New Jersey, USA.
[21]Russell SJ, Norvig P, 2010. Artificial Intelligence: a Modern Approach (3rd Ed.). Prentice Hall, New Jersey, USA.
[22]Sharifi F, Chamseddine A, Mahboubi H, et al., 2016. A distributed deployment strategy for a network of cooperative autonomous vehicles. IEEE Trans Contr Syst Technol, 23(2):737-745.
[23]Sutton RS, Barto AG, 1998. Reinforcement Learning: an Introduction. MIT Press, Massachusetts, USA.
[24]Thiery C, Scherrer B, 2010. Least-squares λ policy iteration: bias-variance trade-off in control problems. Proc 27th Int Conf on Machine Learning, p.1071-1078.
[25]Wang B, Zhang YM, 2018. An adaptive fault-tolerant sliding mode control allocation scheme for multirotor helicopter subject to simultaneous actuator faults. IEEE Trans Ind Electron, 65(5):4227-4236.
[26]Wang B, Yu X, Mu LX, et al., 2019. Disturbance observer-based adaptive fault-tolerant control for a quadrotor helicopter subject to parametric uncertainties and external disturbances. Mech Syst Signal Process, 120:727-743.
[27]Yu HZ, 2010. Convergence of least squares temporal difference methods under general conditions. 27th Int Conf on Machine Learning, p.1207-1214.
[28]Yu HZ, 2012. Least squares temporal difference methods: an analysis under general conditions. SIAM J Contr Optim, 50(6):3310-3343.
[29]Yuan C, Zhang YM, Liu ZX, 2015. A survey on technologies for automatic forest fire monitoring, detection, and fighting using unmanned aerial vehicles and remote sensing techniques. Can J Forest Res, 45(7):783-792.
Open peer comments: Debate/Discuss/Question/Opinion
<1>