CLC number: TP181
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2023-05-19
Cited: 0
Clicked: 1638
Citations: Bibtex RefMan EndNote GB/T7714
Shihmin WANG, Binqi ZHAO, Zhengfeng ZHANG, Junping ZHANG, Jian PU. Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning[J]. Frontiers of Information Technology & Electronic Engineering, 2023, 24(11): 1541-1556.
@article{title="Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning",
author="Shihmin WANG, Binqi ZHAO, Zhengfeng ZHANG, Junping ZHANG, Jian PU",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="24",
number="11",
pages="1541-1556",
year="2023",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2300084"
}
%0 Journal Article
%T Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning
%A Shihmin WANG
%A Binqi ZHAO
%A Zhengfeng ZHANG
%A Junping ZHANG
%A Jian PU
%J Frontiers of Information Technology & Electronic Engineering
%V 24
%N 11
%P 1541-1556
%@ 2095-9184
%D 2023
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2300084
TY - JOUR
T1 - Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning
A1 - Shihmin WANG
A1 - Binqi ZHAO
A1 - Zhengfeng ZHANG
A1 - Junping ZHANG
A1 - Jian PU
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 24
IS - 11
SP - 1541
EP - 1556
%@ 2095-9184
Y1 - 2023
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2300084
Abstract: As one of the most fundamental topics in reinforcement learning (RL), sample efficiency is essential to the deployment of deep RL algorithms. Unlike most existing exploration methods that sample an action from different types of posterior distributions, we focus on the policy sampling process and propose an efficient selective sampling approach to improve sample efficiency by modeling the internal hierarchy of the environment. Specifically, we first employ clustering methods in the policy sampling process to generate an action candidate set. Then we introduce a clustering buffer for modeling the internal hierarchy, which consists of on-policy data, off-policy data, and expert data to evaluate actions from the clusters in the action candidate set in the exploration stage. In this way, our approach is able to take advantage of the supervision information in the expert demonstration data. Experiments on six different continuous locomotion environments demonstrate superior reinforcement learning performance and faster convergence of selective sampling. In particular, on the LGSVL task, our method can reduce the number of convergence steps by 46.7% and the convergence time by 28.5%. Furthermore, our code is open-source for reproducibility. The code is available at https://github.com/Shihwin/SelectiveSampling.
[1]Andrychowicz M, Wolski F, Ray A, et al., 2017. Hindsight experience replay. Proc 31st Int Conf on Neural Information Processing Systems, p.5055-5065.
[2]Bellemare MG, Srinivasan S, Ostrovski G, et al., 2016. Unifying count-based exploration and intrinsic motivation. Proc 30th Int Conf on Neural Information Processing Systems, p.1479-1487.
[3]Brockman G, Cheung V, Pettersson L, et al., 2016. OpenAI Gym. https://arxiv.org/abs/1606.01540
[4]Cheung WC, Simchi-Levi D, Zhu RH, 2020. Reinforcement learning for non-stationary Markov decision processes: the blessing of (more) optimism. Proc 37th Int Conf on Machine Learning, Article 172.
[5]Dai XY, Zhao C, Wang X, et al., 2022. Image-based traffic signal control via world models. Front Inform Technol Electron Eng, 23(12):1795-1813.
[6]Fu J, Luo K, Levine S, 2017. Learning robust rewards with adversarial inverse reinforcement learning. https://arxiv.org/abs/1710.11248
[7]Goodfellow I, Pouget-Abadie J, Mirza M, et al., 2020. Generative adversarial networks. Commun ACM, 63(11):139-144.
[8]Haarnoja T, Zhou A, Abbeel P, et al., 2018. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proc 35th Int Conf on Machine Learning, p.1861-1870.
[9]Hester T, Vecerik M, Pietquin O, et al., 2018. Deep Q-learning from demonstrations. Proc AAAI Conf on Artificial Intelligence, p.394.
[10]Ho J, Ermon S, 2016. Generative adversarial imitation learning. Proc 30th Int Conf on Neural Information Processing Systems, p.4572-4580.
[11]Houthooft R, Chen X, Duan Y, et al., 2016. VIME: variational information maximizing exploration. Proc 30th Int Conf on Neural Information Processing Systems, p.1117-1125.
[12]Huang ZZ, Chen J, Zhang JP, et al., 2023. Learning representation for clustering via prototype scattering and positive sampling. IEEE Trans Patt Anal Mach Intell, 45(6):7509-7524.
[13]Kingma DP, Welling M, 2014. Auto-encoding variational Bayes. Proc 2nd Int Conf on Learning Representations.
[14]Li HQ, Huang J, Cao Z, et al., 2023. Stochastic pedestrian avoidance for autonomous vehicles using hybrid reinforcement learning. Front Inform Technol Electron Eng, 24(1):131-140.
[15]Liu SP, Tian GH, Cui YC, et al., 2022. A deep Q-learning network based active object detection model with a novel training algorithm for service robots. Front Inform Technol Electron Eng, 23(11):1673-1683.
[16]Moerland TM, Broekens J, Plaat A, et al., 2023. Model-based reinforcement learning: a survey. Found Trends Mach Learn, 16(1):1-118.
[17]Murtagh F, Legendre P, 2014. Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion. J Classif, 31(3):274-295.
[18]Nair A, McGrew B, Andrychowicz M, et al., 2018. Overcoming exploration in reinforcement learning with demonstrations. Proc IEEE Int Conf on Robotics and Automation, p.6292-6299.
[19]Niu C, Zhang J, Wang G, et al., 2020. GATCluster: self-supervised Gaussian-attention network for image clustering. Proc 16th European Conf on Computer Vision, p.735-751.
[20]Niu C, Shan HM, Wang G, 2022. SPICE: semantic pseudo-labeling for image clustering. IEEE Trans Image Process, 31:7264-7278.
[21]Ravichandar H, Polydoros AS, Chernova S, et al., 2020. Recent advances in robot learning from demonstration. Ann Rev Contr Robot Auton Syst, 3:297-330.
[22]Rong GD, Shin BH, Tabatabaee H, et al., 2020. LGSVL Simulator: a high fidelity simulator for autonomous driving. Proc 23rd Int Conf on Intelligent Transportation Systems, p.1-6.
[23]Schaul T, Quan J, Antonoglou I, et al., 2016. Prioritized experience replay. Proc 4th Int Conf on Learning Representations.
[24]Schulman J, Levine S, Moritz P, et al., 2015. Trust region policy optimization. Proc 32nd Int Conf on Machine Learning, p.1889-1897.
[25]Schulman J, Moritz P, Levine S, et al., 2016. High-dimensional continuous control using generalized advantage estimation. Proc 4th Int Conf on Learning Representations.
[26]Schulman J, Wolski F, Dhariwal P, et al., 2017. Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347
[27]Silver D, Huang A, Maddison CJ, et al., 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484-489.
[28]Sun PQ, Zhou WG, Li HQ, 2020. Attentive experience replay. Proc AAAI Conf on Artificial Intelligence, p.5900-5907.
[29]Sutton RS, Barto AG, 1998. Reinforcement Learning: an Introduction. MIT Press, Cambridge, UK.
[30]Vecerik M, Hester T, Scholz J, et al., 2017. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. https://arxiv.org/abs/1707.08817
[31]Vinyals O, Babuschkin I, Czarnecki WM, et al., 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350-354.
[32]Wang SC, Li B, 2020. Implicit posterior sampling reinforcement learning for continuous control. Proc 27th Int Conf on Neural Information Processing, p.452-460.
[33]Xie JY, Girshick R, Farhadi A, 2016. Unsupervised deep embedding for clustering analysis. Proc 33rd Int Conf on Machine Learning, p.478-487.
[34]Xue JR, Hu B, Li LX, et al., 2022. Human-machine augmented intelligence: research and applications. Front Inform Technol Electron Eng, 23(8):1139-1141.
[35]Ye DH, Chen GB, Zhang W, et al., 2020. Towards playing full MOBA games with deep reinforcement learning. Proc 34th Int Conf on Neural Information Processing Systems, Article 53.
[36]Zhang JP, Pu J, Xue JR, et al., 2023a. HiVeGPT: human-machine-augmented intelligent vehicles with generative pre-trained transformer. IEEE Trans Intell Veh, 8(3):2027-2033.
[37]Zhang JP, Pu J, Chen J, et al., 2023b. DSiV: data science for intelligent vehicles. IEEE Trans Intell Veh, 8(4):2628-2634.
[38]Zhou J, Ke P, Qiu XP, et al., 2023. ChatGPT: potential, prospects, and limitations. Front Inform Technol Electron Eng, early access.
Open peer comments: Debate/Discuss/Question/Opinion
<1>