JZUS - Journal of Zhejiang University SCIENCE

ENGINEERING Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning

Author(s): Shihmin WANG, Binqi ZHAO, Zhengfeng ZHANG, Junping ZHANG, Jian PU
Affiliation(s): Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai 200433, China; more
Corresponding email(s): wangshimin20@fudan.edu.cn, bqzhao20@fudan.edu.cn, jpzhang@fudan.edu.cn, jianpu@fudan.edu.cn
Key Words: Reinforcement learning; Sample efficiency; Sampling process; Clustering methods; Autonomous driving

Share this article to： More <<< Previous Paper \|Next Paper >>>

Shihmin WANG, Binqi ZHAO, Zhengfeng ZHANG, Junping ZHANG, Jian PU. Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2300084

@article{title="Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning",
author="Shihmin WANG, Binqi ZHAO, Zhengfeng ZHANG, Junping ZHANG, Jian PU",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2300084"
}

%0 Journal Article
%T Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning
%A Shihmin WANG
%A Binqi ZHAO
%A Zhengfeng ZHANG
%A Junping ZHANG
%A Jian PU
%J Frontiers of Information Technology & Electronic Engineering
%P 1541-1556
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2300084"

TY - JOUR
T1 - Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning
A1 - Shihmin WANG
A1 - Binqi ZHAO
A1 - Zhengfeng ZHANG
A1 - Junping ZHANG
A1 - Jian PU
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 1541
EP - 1556
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2300084"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: As one of the most fundamental topics in reinforcement learning (RL), sample efficiency is essential to the deployment of deep RL algorithms. Unlike most existing exploration methods that sample an action from different types of posterior distributions, we focus on the policy sampling process and propose an efficient selective sampling approach to improve sample efficiency by modeling the internal hierarchy of the environment. Specifically, we first employ clustering methods in the policy sampling process to generate an action candidate set. Then we introduce a clustering buffer for modeling the internal hierarchy, which consists of on-policy data, off-policy data, and expert data to evaluate actions from the clusters in the action candidate set in the exploration stage. In this way, our approach is able to take advantage of the supervision information in the expert demonstration data. Experiments on six different continuous locomotion environments demonstrate superior reinforcement learning performance and faster convergence of selective sampling. In particular, on the LGSVL task, our method can reduce the number of convergence steps by 46.7% and the convergence time by 28.5%. Furthermore, our code is open-source for reproducibility. The code is available at https://github.com/Shihwin/SelectiveSampling.

基于专家示教聚类经验池的高效深度强化学习

王士珉¹，赵彬琦¹，张政锋¹，张军平¹，浦剑²
¹复旦大学计算机科学技术学院上海市智能信息处理重点实验室，中国上海市，200433
²复旦大学类脑智能科学与技术研究院，中国上海市，200433
摘要：作为强化学习领域最基本的主题之一，样本效率对于深度强化学习算法的部署至关重要。与现有大多数从不同类型的后验分布中对动作进行采样的探索方法不同，我们专注于策略的采样过程，提出一种有效的选择性采样方法，通过对环境的内部层次结构建模来提高样本效率。具体来说，首先在策略采样过程中使用聚类方法生成动作候选集，随后引入一个用于对内部层次结构建模的聚类缓冲区，它由同轨数据、异轨数据以及专家数据组成，用于评估探索阶段动作候选集中不同类别动作的价值。通过这种方式，我们的方法能够更多地利用专家示教数据中的监督信息。在6种不同的连续运动环境中进行了实验，结果表明选择性采样方法具有卓越的强化学习性能和更快的收敛速度。特别地，在LGSVL任务中，该方法可以减少46.7%的收敛步数和28.5%的收敛时间。代码已开源，见https://github.com/Shihwin/SelectiveSampling。

关键词组：强化学习；采样效率；采样过程；聚类方法；自动驾驶

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Andrychowicz M, Wolski F, Ray A, et al., 2017. Hindsight experience replay. Proc 31^st Int Conf on Neural Information Processing Systems, p.5055-5065.

[2]Bellemare MG, Srinivasan S, Ostrovski G, et al., 2016. Unifying count-based exploration and intrinsic motivation. Proc 30^th Int Conf on Neural Information Processing Systems, p.1479-1487.

[3]Brockman G, Cheung V, Pettersson L, et al., 2016. OpenAI Gym. https://arxiv.org/abs/1606.01540

[4]Cheung WC, Simchi-Levi D, Zhu RH, 2020. Reinforcement learning for non-stationary Markov decision processes: the blessing of (more) optimism. Proc 37^th Int Conf on Machine Learning, Article 172.

[5]Dai XY, Zhao C, Wang X, et al., 2022. Image-based traffic signal control via world models. Front Inform Technol Electron Eng, 23(12):1795-1813.

[6]Fu J, Luo K, Levine S, 2017. Learning robust rewards with adversarial inverse reinforcement learning. https://arxiv.org/abs/1710.11248

[7]Goodfellow I, Pouget-Abadie J, Mirza M, et al., 2020. Generative adversarial networks. Commun ACM, 63(11):139-144.

[8]Haarnoja T, Zhou A, Abbeel P, et al., 2018. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proc 35^th Int Conf on Machine Learning, p.1861-1870.

[9]Hester T, Vecerik M, Pietquin O, et al., 2018. Deep Q-learning from demonstrations. Proc AAAI Conf on Artificial Intelligence, p.394.

[10]Ho J, Ermon S, 2016. Generative adversarial imitation learning. Proc 30^th Int Conf on Neural Information Processing Systems, p.4572-4580.

[11]Houthooft R, Chen X, Duan Y, et al., 2016. VIME: variational information maximizing exploration. Proc 30^th Int Conf on Neural Information Processing Systems, p.1117-1125.

[12]Huang ZZ, Chen J, Zhang JP, et al., 2023. Learning representation for clustering via prototype scattering and positive sampling. IEEE Trans Patt Anal Mach Intell, 45(6):7509-7524.

[13]Kingma DP, Welling M, 2014. Auto-encoding variational Bayes. Proc 2^nd Int Conf on Learning Representations.

[14]Li HQ, Huang J, Cao Z, et al., 2023. Stochastic pedestrian avoidance for autonomous vehicles using hybrid reinforcement learning. Front Inform Technol Electron Eng, 24(1):131-140.

[15]Liu SP, Tian GH, Cui YC, et al., 2022. A deep Q-learning network based active object detection model with a novel training algorithm for service robots. Front Inform Technol Electron Eng, 23(11):1673-1683.

[16]Moerland TM, Broekens J, Plaat A, et al., 2023. Model-based reinforcement learning: a survey. Found Trends Mach Learn, 16(1):1-118.

[17]Murtagh F, Legendre P, 2014. Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion. J Classif, 31(3):274-295.

[18]Nair A, McGrew B, Andrychowicz M, et al., 2018. Overcoming exploration in reinforcement learning with demonstrations. Proc IEEE Int Conf on Robotics and Automation, p.6292-6299.

[19]Niu C, Zhang J, Wang G, et al., 2020. GATCluster: self-supervised Gaussian-attention network for image clustering. Proc 16^th European Conf on Computer Vision, p.735-751.

[20]Niu C, Shan HM, Wang G, 2022. SPICE: semantic pseudo-labeling for image clustering. IEEE Trans Image Process, 31:7264-7278.

[21]Ravichandar H, Polydoros AS, Chernova S, et al., 2020. Recent advances in robot learning from demonstration. Ann Rev Contr Robot Auton Syst, 3:297-330.

[22]Rong GD, Shin BH, Tabatabaee H, et al., 2020. LGSVL Simulator: a high fidelity simulator for autonomous driving. Proc 23^rd Int Conf on Intelligent Transportation Systems, p.1-6.

[23]Schaul T, Quan J, Antonoglou I, et al., 2016. Prioritized experience replay. Proc 4^th Int Conf on Learning Representations.

[24]Schulman J, Levine S, Moritz P, et al., 2015. Trust region policy optimization. Proc 32^nd Int Conf on Machine Learning, p.1889-1897.

[25]Schulman J, Moritz P, Levine S, et al., 2016. High-dimensional continuous control using generalized advantage estimation. Proc 4^th Int Conf on Learning Representations.

[26]Schulman J, Wolski F, Dhariwal P, et al., 2017. Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347

[27]Silver D, Huang A, Maddison CJ, et al., 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484-489.

[28]Sun PQ, Zhou WG, Li HQ, 2020. Attentive experience replay. Proc AAAI Conf on Artificial Intelligence, p.5900-5907.

[29]Sutton RS, Barto AG, 1998. Reinforcement Learning: an Introduction. MIT Press, Cambridge, UK.

[30]Vecerik M, Hester T, Scholz J, et al., 2017. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. https://arxiv.org/abs/1707.08817

[31]Vinyals O, Babuschkin I, Czarnecki WM, et al., 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350-354.

[32]Wang SC, Li B, 2020. Implicit posterior sampling reinforcement learning for continuous control. Proc 27^th Int Conf on Neural Information Processing, p.452-460.

[33]Xie JY, Girshick R, Farhadi A, 2016. Unsupervised deep embedding for clustering analysis. Proc 33^rd Int Conf on Machine Learning, p.478-487.

[34]Xue JR, Hu B, Li LX, et al., 2022. Human-machine augmented intelligence: research and applications. Front Inform Technol Electron Eng, 23(8):1139-1141.

[35]Ye DH, Chen GB, Zhang W, et al., 2020. Towards playing full MOBA games with deep reinforcement learning. Proc 34^th Int Conf on Neural Information Processing Systems, Article 53.

[36]Zhang JP, Pu J, Xue JR, et al., 2023a. HiVeGPT: human-machine-augmented intelligent vehicles with generative pre-trained transformer. IEEE Trans Intell Veh, 8(3):2027-2033.

[37]Zhang JP, Pu J, Chen J, et al., 2023b. DSiV: data science for intelligent vehicles. IEEE Trans Intell Veh, 8(4):2628-2634.

[38]Zhou J, Ke P, Qiu XP, et al., 2023. ChatGPT: potential, prospects, and limitations. Front Inform Technol Electron Eng, early access.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

基于专家示教聚类经验池的高效深度强化学习

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference