
CLC number: TP181
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2023-05-19
Cited: 0
Clicked: 4335
Citations: Bibtex RefMan EndNote GB/T7714
Shihmin WANG, Binqi ZHAO, Zhengfeng ZHANG, Junping ZHANG, Jian PU. Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2300084 @article{title="Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning", %0 Journal Article TY - JOUR
基于专家示教聚类经验池的高效深度强化学习1复旦大学计算机科学技术学院上海市智能信息处理重点实验室,中国上海市,200433 2复旦大学类脑智能科学与技术研究院,中国上海市,200433 摘要:作为强化学习领域最基本的主题之一,样本效率对于深度强化学习算法的部署至关重要。与现有大多数从不同类型的后验分布中对动作进行采样的探索方法不同,我们专注于策略的采样过程,提出一种有效的选择性采样方法,通过对环境的内部层次结构建模来提高样本效率。具体来说,首先在策略采样过程中使用聚类方法生成动作候选集,随后引入一个用于对内部层次结构建模的聚类缓冲区,它由同轨数据、异轨数据以及专家数据组成,用于评估探索阶段动作候选集中不同类别动作的价值。通过这种方式,我们的方法能够更多地利用专家示教数据中的监督信息。在6种不同的连续运动环境中进行了实验,结果表明选择性采样方法具有卓越的强化学习性能和更快的收敛速度。特别地,在LGSVL任务中,该方法可以减少46.7%的收敛步数和28.5%的收敛时间。代码已开源,见https://github.com/Shihwin/SelectiveSampling。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Andrychowicz M, Wolski F, Ray A, et al., 2017. Hindsight experience replay. Proc 31st Int Conf on Neural Information Processing Systems, p.5055-5065. ![]() [2]Bellemare MG, Srinivasan S, Ostrovski G, et al., 2016. Unifying count-based exploration and intrinsic motivation. Proc 30th Int Conf on Neural Information Processing Systems, p.1479-1487. ![]() [3]Brockman G, Cheung V, Pettersson L, et al., 2016. OpenAI Gym. https://arxiv.org/abs/1606.01540 ![]() [4]Cheung WC, Simchi-Levi D, Zhu RH, 2020. Reinforcement learning for non-stationary Markov decision processes: the blessing of (more) optimism. Proc 37th Int Conf on Machine Learning, Article 172. ![]() [5]Dai XY, Zhao C, Wang X, et al., 2022. Image-based traffic signal control via world models. Front Inform Technol Electron Eng, 23(12):1795-1813. ![]() [6]Fu J, Luo K, Levine S, 2017. Learning robust rewards with adversarial inverse reinforcement learning. https://arxiv.org/abs/1710.11248 ![]() [7]Goodfellow I, Pouget-Abadie J, Mirza M, et al., 2020. Generative adversarial networks. Commun ACM, 63(11):139-144. ![]() [8]Haarnoja T, Zhou A, Abbeel P, et al., 2018. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proc 35th Int Conf on Machine Learning, p.1861-1870. ![]() [9]Hester T, Vecerik M, Pietquin O, et al., 2018. Deep Q-learning from demonstrations. Proc AAAI Conf on Artificial Intelligence, p.394. ![]() [10]Ho J, Ermon S, 2016. Generative adversarial imitation learning. Proc 30th Int Conf on Neural Information Processing Systems, p.4572-4580. ![]() [11]Houthooft R, Chen X, Duan Y, et al., 2016. VIME: variational information maximizing exploration. Proc 30th Int Conf on Neural Information Processing Systems, p.1117-1125. ![]() [12]Huang ZZ, Chen J, Zhang JP, et al., 2023. Learning representation for clustering via prototype scattering and positive sampling. IEEE Trans Patt Anal Mach Intell, 45(6):7509-7524. ![]() [13]Kingma DP, Welling M, 2014. Auto-encoding variational Bayes. Proc 2nd Int Conf on Learning Representations. ![]() [14]Li HQ, Huang J, Cao Z, et al., 2023. Stochastic pedestrian avoidance for autonomous vehicles using hybrid reinforcement learning. Front Inform Technol Electron Eng, 24(1):131-140. ![]() [15]Liu SP, Tian GH, Cui YC, et al., 2022. A deep Q-learning network based active object detection model with a novel training algorithm for service robots. Front Inform Technol Electron Eng, 23(11):1673-1683. ![]() [16]Moerland TM, Broekens J, Plaat A, et al., 2023. Model-based reinforcement learning: a survey. Found Trends Mach Learn, 16(1):1-118. ![]() [17]Murtagh F, Legendre P, 2014. Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion. J Classif, 31(3):274-295. ![]() [18]Nair A, McGrew B, Andrychowicz M, et al., 2018. Overcoming exploration in reinforcement learning with demonstrations. Proc IEEE Int Conf on Robotics and Automation, p.6292-6299. ![]() [19]Niu C, Zhang J, Wang G, et al., 2020. GATCluster: self-supervised Gaussian-attention network for image clustering. Proc 16th European Conf on Computer Vision, p.735-751. ![]() [20]Niu C, Shan HM, Wang G, 2022. SPICE: semantic pseudo-labeling for image clustering. IEEE Trans Image Process, 31:7264-7278. ![]() [21]Ravichandar H, Polydoros AS, Chernova S, et al., 2020. Recent advances in robot learning from demonstration. Ann Rev Contr Robot Auton Syst, 3:297-330. ![]() [22]Rong GD, Shin BH, Tabatabaee H, et al., 2020. LGSVL Simulator: a high fidelity simulator for autonomous driving. Proc 23rd Int Conf on Intelligent Transportation Systems, p.1-6. ![]() [23]Schaul T, Quan J, Antonoglou I, et al., 2016. Prioritized experience replay. Proc 4th Int Conf on Learning Representations. ![]() [24]Schulman J, Levine S, Moritz P, et al., 2015. Trust region policy optimization. Proc 32nd Int Conf on Machine Learning, p.1889-1897. ![]() [25]Schulman J, Moritz P, Levine S, et al., 2016. High-dimensional continuous control using generalized advantage estimation. Proc 4th Int Conf on Learning Representations. ![]() [26]Schulman J, Wolski F, Dhariwal P, et al., 2017. Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347 ![]() [27]Silver D, Huang A, Maddison CJ, et al., 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484-489. ![]() [28]Sun PQ, Zhou WG, Li HQ, 2020. Attentive experience replay. Proc AAAI Conf on Artificial Intelligence, p.5900-5907. ![]() [29]Sutton RS, Barto AG, 1998. Reinforcement Learning: an Introduction. MIT Press, Cambridge, UK. ![]() [30]Vecerik M, Hester T, Scholz J, et al., 2017. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. https://arxiv.org/abs/1707.08817 ![]() [31]Vinyals O, Babuschkin I, Czarnecki WM, et al., 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350-354. ![]() [32]Wang SC, Li B, 2020. Implicit posterior sampling reinforcement learning for continuous control. Proc 27th Int Conf on Neural Information Processing, p.452-460. ![]() [33]Xie JY, Girshick R, Farhadi A, 2016. Unsupervised deep embedding for clustering analysis. Proc 33rd Int Conf on Machine Learning, p.478-487. ![]() [34]Xue JR, Hu B, Li LX, et al., 2022. Human-machine augmented intelligence: research and applications. Front Inform Technol Electron Eng, 23(8):1139-1141. ![]() [35]Ye DH, Chen GB, Zhang W, et al., 2020. Towards playing full MOBA games with deep reinforcement learning. Proc 34th Int Conf on Neural Information Processing Systems, Article 53. ![]() [36]Zhang JP, Pu J, Xue JR, et al., 2023a. HiVeGPT: human-machine-augmented intelligent vehicles with generative pre-trained transformer. IEEE Trans Intell Veh, 8(3):2027-2033. ![]() [37]Zhang JP, Pu J, Chen J, et al., 2023b. DSiV: data science for intelligent vehicles. IEEE Trans Intell Veh, 8(4):2628-2634. ![]() [38]Zhou J, Ke P, Qiu XP, et al., 2023. ChatGPT: potential, prospects, and limitations. Front Inform Technol Electron Eng, early access. ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2026 Journal of Zhejiang University-SCIENCE | ||||||||||||||


ORCID:
Open peer comments: Debate/Discuss/Question/Opinion
<1>