Journal of Zhejiang University

Frontiers of Information Technology & Electronic Engineering 2020 Vol.21 No.5 P.777-795

http://doi.org/10.1631/FITEE.1900641

Proximal policy optimization with an integral compensator for quadrotor control

Author(s): Huan Hu, Qing-ling Wang
Affiliation(s): 1. School of Automation, Southeast University, Nanjing 210096, China
Corresponding email(s): qlwang@seu.edu.cn
Key Words: Reinforcement learning, Proximal policy optimization, Quadrotor control, Neural network

Share this article to： More <<< Previous Article \|Next Article >>>

Huan Hu, Qing-ling Wang. Proximal policy optimization with an integral compensator for quadrotor control[J]. Frontiers of Information Technology & Electronic Engineering, 2020, 21(5): 777-795.

@article{title="Proximal policy optimization with an integral compensator for quadrotor control",
author="Huan Hu, Qing-ling Wang",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="21",
number="5",
pages="777-795",
year="2020",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1900641"
}

%0 Journal Article
%T Proximal policy optimization with an integral compensator for quadrotor control
%A Huan Hu
%A Qing-ling Wang
%J Frontiers of Information Technology & Electronic Engineering
%V 21
%N 5
%P 777-795
%@ 2095-9184
%D 2020
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1900641

TY - JOUR
T1 - Proximal policy optimization with an integral compensator for quadrotor control
A1 - Huan Hu
A1 - Qing-ling Wang
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 21
IS - 5
SP - 777
EP - 795
%@ 2095-9184
Y1 - 2020
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1900641

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: We use the advanced proximal policy optimization (PPO) reinforcement learning algorithm to optimize the stochastic control strategy to achieve speed control of the “model-free” quadrotor. The model is controlled by four learned neural networks, which directly map the system states to control commands in an end-to-end style. By introducing an integral compensator into the actor-critic framework, the speed tracking accuracy and robustness have been greatly enhanced. In addition, a two-phase learning scheme which includes both offline- and online-learning is developed for practical use. A model with strong generalization ability is learned in the offline phase. Then, the flight policy of the model is continuously optimized in the online learning phase. Finally, the performances of our proposed algorithm are compared with those of the traditional PID algorithm.

基于带积分补偿近端策略优化算法的四旋翼控制

胡欢，王庆领
东南大学自动化学院，中国南京市，210096

摘要：使用先进的近端策略优化强化学习算法优化随机控制策略，实现对无模型四旋翼飞行器速度的稳定控制。飞行器模型由4个可以学习训练的子神经网络控制，神经网络以一种端到端的方式将模型状态映射为控制命令输送给飞行器执行。将积分补偿器引入行为评估算法框架，可大大提高模型速度跟踪的准确性和鲁棒性。此外，开发了包括离线学习和在线学习的两阶段学习方案，以供实际飞行之需。在在线学习阶段，不断优化模型的飞行策略。最后，对比提出的算法与传统PID算法的实验效果。

关键词：强化学习；近端策略优化；四旋翼控制；神经网络

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Abadi M, Barham P, Chen JM, et al., 2016. TensorFlow: a system for large-scale machine learning. Proc 12^th USENIX Conf on Operating Systems Design and Implementation, p.265-283.

[2]Alexis K, Nikolakopoulos G, Tzes A, 2012. Model predictive quadrotor control: attitude, altitude and position experimental studies. IET Contr Theory Appl, 6(12):1812-1827.

[3]Amari SI, 1998. Natural gradient works efﬁciently in learning. Neur Comput, 10(2):251-276.

[4]Antonelli G, Cataldi E, Arrichiello F, et al., 2018. Adaptive trajectory tracking for quadrotor MAVs in presence of parameter uncertainties and external disturbances. IEEE Trans Contr Syst Technol, 26(1):248-254.

[5]Bobtsov A, Guirik A, Budko M, et al., 2016. Hybrid parallel neuro-controller for multirotor unmanned aerial vehicle. Proc 8^th Int Congress on Ultra Modern Telecommunications and Control Systems and Workshops, p.1-4.

[6]Bouabdallah S, Noth A, Siegwart R, 2004. PID vs LQ control techniques applied to an indoor micro quadrotor. Proc IEEE/RSJ Int Conf on Intelligent Robots and Systems, p.2451-2456.

[7]Dierks T, Jagannathan S, 2010. Output feedback control of a quadrotor UAV using neural networks. IEEE Trans Neur Netw, 21(1):50-66.

[8]Duan Y, Chen X, Houthooft R, et al., 2016. Benchmarking deep reinforcement learning for continuous control. Proc 33^rd Int Conf on Machine Learning, p.1329-1338.

[9]Fumagalli M, Naldi R, Macchelli A, et al., 2012. Modeling and control of a ﬂying robot for contact inspection. Proc IEEE/RSJ Int Conf on Intelligent Robots and Systems, p.3532-3537.

[10]Hwangbo J, Sa I, Siegwart R, et al., 2017. Control of a quadrotor with reinforcement learning. IEEE Robot Autom Lett, 2(4):2096-2103.

[11]Kakade S, Langford J, 2002. Approximately optimal approximate reinforcement learning. Proc 19^th Int Conf on Machine Learning, p.267-274.

[12]Kingma DP, Ba J, 2014. ADAM: a method for stochastic optimization. https://arxiv.org/abs/1412.6980

[13]Lee T, 2013. Robust adaptive attitude tracking on SO(3) with an application to a quadrotor UAV. IEEE Trans Contr Syst Technol, 21(5):1924-1930.

[14]Lillicrap TP, Hunt JJ, Pritzel A, et al., 2016. Continuous control with deep reinforcement learning. https://arxiv.org/abs/1509.02971

[15]Miglino O, Lund HH, Nolﬁ S, 1995. Evolving mobile robots in simulated and real environments. Artif Life, 2(4):417-434.

[16]Mnih V, Kavukcuoglu K, Silver D, et al., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529-533.

[17]Quanser, 2015. User Manual Qball 2 for QUARC: Set Up and Conﬁguration. Quanser, Inc., Markham, ON, Canada.

[18]Rozi HA, Susanto E, Dwibawa IP, 2017. Quadrotor model with proportional derivative controller. Proc Int Conf on Control, Electronics, Renewable Energy and Communications, p.241-246.

[19]Salih AL, Moghavvemi M, Mohamed HAF, et al., 2010. Flight PID controller design for a UAV quadrotor. Sci Res Essays, 5(23):3660-3667.

[20]Santoso F, Garratt MA, Anavatti SG, 2018. State-of-the-art intelligent ﬂight control systems in unmanned aerial vehicles. IEEE Trans Autom Sci Eng, 15(2):613-627.

[21]Schulman J, 2016. Optimizing Expectations: from Deep Reinforcement Learning to Stochastic Computation Graphs. PhD Thesis, University of California, Berkeley, USA.

[22]Schulman J, Levine S, Moritz P, et al., 2015. Trust region policy optimization. Proc 31^st Int Conf on Machine Learning, p.1889-1897.

[23]Schulman J, Wolski F, Dhariwal P, et al., 2017. Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347

[24]Shi DJ, Dai XH, Zhang XW, et al., 2017. A practical performance evaluation method for electric multicopters. IEEE/ASME Trans Mechatr, 22(3):1337-1348.

[25]Silver D, Lever G, Heess N, et al., 2014. Deterministic policy gradient algorithms. Proc 31^st Int Conf on Machine Learning, p.1-9.

[26]Silver D, Huang A, Maddison CJ, et al., 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484-489.

[27]Sutton RS, 1995. Generalization in reinforcement learning: successful examples using sparse coarse coding. Proc 8^th Int Conf on Neural Information Processing Systems, p.1038-1044.

[28]Sutton RS, Barto AG, 1998. Reinforcement Learning: an Introduction. MIT Press, Cambridge, USA.

[29]Tomic T, Schmid K, Lutz P, et al., 2012. Toward a fully autonomous UAV: research platform for indoor and outdoor urban search and rescue. IEEE Robot Autom Mag, 19(3): 46-56.

[30]Valente J, del Cerro J, Barrientos A, et al., 2013. Aerial coverage optimization in precision agriculture management: a musical harmony inspired approach. Comput Electron Agric, 99:153-159.

[31]Valenti RG, Jian YD, Ni K, et al., 2016. An autonomous ﬂyer photographer. Proc IEEE Int Conf on Cyber Technology in Automation, Control, and Intelligent Systems, p.273- 278.

[32]van Hasselt H, 2010. Double Q-learning. Proc 23^rd Int Conf on Neural Information Processing Systems, p.2613-2621.

[33]van Hasselt H, Guez A, Silver D, 2016. Deep reinforcement learning with double Q-learning. Proc 30^th AAAI Conf on Artificial Intelligence, p.2094-2100.

[34]Wang YD, Sun J, He HB, et al., 2019. Deterministic policy gradient with integral compensator for robust quadrotor control. IEEE Trans Syst Man Cybern Syst, p.1-13.

[35]Waslander SL, Hoffmann GM, Jang JS, et al., 2005. Multi- agent quadrotor testbed control design: integral sliding mode vs. reinforcement learning. Proc IEEE/RSJ Int Conf on Intelligent Robots and Systems, p.3712-3717.

[36]Watkins CJCH, Dayan P, 1992. Q-learning. Mach Learn, 8(3-4):279-292.

[37]Williams-Hayes PS, 2005. Flight test implementation of a second generation intelligent ﬂight control system. Proc Infotech@Aerospace, p.26-29.

[38]Xu B, 2018. Composite learning ﬁnite-time control with application to quadrotors. IEEE Trans Syst Man Cybern Syst, 48(10):1806-1815.

[39]Xu R, Ozguner U, 2006. Sliding mode control of a quadrotor helicopter. Proc 45^th IEEE Conf on Decision and Control, p.4957-4962.

[40]Yang HJ, Cheng L, Xia YQ, et al., 2018. Active disturbance rejection attitude control for a dual closed-loop quadrotor under gust wind. IEEE Trans Contr Syst Technol, 26(4): 1400-1405.

[41]Yechiel O, Guterman H, 2017. A survey of adaptive control. Int Rob Autom J, 3(2):290-292.

Open peer comments: Debate/Discuss/Question/Opinion

<1>