Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning

Abstract: As one of the most fundamental topics in reinforcement learning (RL), sample efficiency is essential to the deployment of deep RL algorithms. Unlike most existing exploration methods that sample an action from different types of posterior distributions, we focus on the policy sampling process and propose an efficient selective sampling approach to improve sample efficiency by modeling the internal hierarchy of the environment. Specifically, we first employ clustering methods in the policy sampling process to generate an action candidate set. Then we introduce a clustering buffer for modeling the internal hierarchy, which consists of on-policy data, off-policy data, and expert data to evaluate actions from the clusters in the action candidate set in the exploration stage. In this way, our approach is able to take advantage of the supervision information in the expert demonstration data. Experiments on six different continuous locomotion environments demonstrate superior reinforcement learning performance and faster convergence of selective sampling. In particular, on the LGSVL task, our method can reduce the number of convergence steps by 46.7% and the convergence time by 28.5%. Furthermore, our code is open-source for reproducibility. The code is available at https://github.com/Shihwin/SelectiveSampling.

Key words: Reinforcement learning; Sample efficiency; Sampling process; Clustering methods; Autonomous driving

Chinese Summary  <17> 基于专家示教聚类经验池的高效深度强化学习

王士珉1,赵彬琦1,张政锋1,张军平1,浦剑2
1复旦大学计算机科学技术学院上海市智能信息处理重点实验室,中国上海市,200433
2复旦大学类脑智能科学与技术研究院,中国上海市,200433
摘要:作为强化学习领域最基本的主题之一,样本效率对于深度强化学习算法的部署至关重要。与现有大多数从不同类型的后验分布中对动作进行采样的探索方法不同,我们专注于策略的采样过程,提出一种有效的选择性采样方法,通过对环境的内部层次结构建模来提高样本效率。具体来说,首先在策略采样过程中使用聚类方法生成动作候选集,随后引入一个用于对内部层次结构建模的聚类缓冲区,它由同轨数据、异轨数据以及专家数据组成,用于评估探索阶段动作候选集中不同类别动作的价值。通过这种方式,我们的方法能够更多地利用专家示教数据中的监督信息。在6种不同的连续运动环境中进行了实验,结果表明选择性采样方法具有卓越的强化学习性能和更快的收敛速度。特别地,在LGSVL任务中,该方法可以减少46.7%的收敛步数和28.5%的收敛时间。代码已开源,见https://github.com/Shihwin/SelectiveSampling。

关键词组:强化学习;采样效率;采样过程;聚类方法;自动驾驶


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.2300084

CLC number:

TP181

Download Full Text:

Click Here

Downloaded:

5891

Download summary:

<Click Here> 

Downloaded:

205

Clicked:

947

Cited:

0

On-line Access:

2023-12-04

Received:

2023-02-12

Revision Accepted:

2023-12-05

Crosschecked:

2023-05-19

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE