|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2023 Vol.24 No.11 P.1541-1556
Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning
Abstract: As one of the most fundamental topics in reinforcement learning (RL), sample efficiency is essential to the deployment of deep RL algorithms. Unlike most existing exploration methods that sample an action from different types of posterior distributions, we focus on the policy sampling process and propose an efficient selective sampling approach to improve sample efficiency by modeling the internal hierarchy of the environment. Specifically, we first employ clustering methods in the policy sampling process to generate an action candidate set. Then we introduce a clustering buffer for modeling the internal hierarchy, which consists of on-policy data, off-policy data, and expert data to evaluate actions from the clusters in the action candidate set in the exploration stage. In this way, our approach is able to take advantage of the supervision information in the expert demonstration data. Experiments on six different continuous locomotion environments demonstrate superior reinforcement learning performance and faster convergence of selective sampling. In particular, on the LGSVL task, our method can reduce the number of convergence steps by 46.7% and the convergence time by 28.5%. Furthermore, our code is open-source for reproducibility. The code is available at https://github.com/Shihwin/SelectiveSampling.
Key words: Reinforcement learning; Sample efficiency; Sampling process; Clustering methods; Autonomous driving
1复旦大学计算机科学技术学院上海市智能信息处理重点实验室,中国上海市,200433
2复旦大学类脑智能科学与技术研究院,中国上海市,200433
摘要:作为强化学习领域最基本的主题之一,样本效率对于深度强化学习算法的部署至关重要。与现有大多数从不同类型的后验分布中对动作进行采样的探索方法不同,我们专注于策略的采样过程,提出一种有效的选择性采样方法,通过对环境的内部层次结构建模来提高样本效率。具体来说,首先在策略采样过程中使用聚类方法生成动作候选集,随后引入一个用于对内部层次结构建模的聚类缓冲区,它由同轨数据、异轨数据以及专家数据组成,用于评估探索阶段动作候选集中不同类别动作的价值。通过这种方式,我们的方法能够更多地利用专家示教数据中的监督信息。在6种不同的连续运动环境中进行了实验,结果表明选择性采样方法具有卓越的强化学习性能和更快的收敛速度。特别地,在LGSVL任务中,该方法可以减少46.7%的收敛步数和28.5%的收敛时间。代码已开源,见https://github.com/Shihwin/SelectiveSampling。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.2300084
CLC number:
TP181
Download Full Text:
Downloaded:
8837
Download summary:
<Click Here>Downloaded:
384Clicked:
1951
Cited:
0
On-line Access:
2024-08-27
Received:
2023-10-17
Revision Accepted:
2024-05-08
Crosschecked:
2023-05-19