Journal of Zhejiang University

Journal of Zhejiang University SCIENCE C 2014 Vol.15 No.3 P.223-231

Greedy feature replacement for online value function approximation

Author(s): Feng-fei Zhao, Zheng Qin, Zhuo Shao, Jun Fang, Bo-yan Ren
Affiliation(s): 1. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China more
Corresponding email(s): zhaofengfei@gmail.com, qingzh@tsinghua.edu.cn, shaoz09@mails.tsinghua.edu.cn
Key Words: Reinforcement learning, Function approximation, Feature dependency, Online expansion, Feature replacement

Share this article to： More <<< Previous Article \|Next Article >>>

Feng-fei Zhao, Zheng Qin, Zhuo Shao, Jun Fang, Bo-yan Ren. Greedy feature replacement for online value function approximation[J]. Journal of Zhejiang University Science C, 2014, 15(3): 223-231.

@article{title="Greedy feature replacement for online value function approximation",
author="Feng-fei Zhao, Zheng Qin, Zhuo Shao, Jun Fang, Bo-yan Ren",
journal="Journal of Zhejiang University Science C",
volume="15",
number="3",
pages="223-231",
year="2014",
publisher="Zhejiang University Press & Springer",
doi="10.1631/jzus.C1300246"
}

%0 Journal Article
%T Greedy feature replacement for online value function approximation
%A Feng-fei Zhao
%A Zheng Qin
%A Zhuo Shao
%A Jun Fang
%A Bo-yan Ren
%J Journal of Zhejiang University SCIENCE C
%V 15
%N 3
%P 223-231
%@ 1869-1951
%D 2014
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.C1300246

TY - JOUR
T1 - Greedy feature replacement for online value function approximation
A1 - Feng-fei Zhao
A1 - Zheng Qin
A1 - Zhuo Shao
A1 - Jun Fang
A1 - Bo-yan Ren
J0 - Journal of Zhejiang University Science C
VL - 15
IS - 3
SP - 223
EP - 231
%@ 1869-1951
Y1 - 2014
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.C1300246

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: reinforcement learning (RL) in real-world problems requires function approximations that depend on selecting the appropriate feature representations. Representational expansion techniques can make linear approximators represent value functions more effectively; however, most of these techniques function well only for low dimensional problems. In this paper, we present the greedy feature replacement (GFR), a novel online expansion technique, for value-based RL algorithms that use binary features. Given a simple initial representation, the feature representation is expanded incrementally. New feature dependencies are added automatically to the current representation and conjunctive features are used to replace current features greedily. The virtual temporal difference (TD) error is recorded for each conjunctive feature to judge whether the replacement can improve the approximation. Correctness guarantees and computational complexity analysis are provided for GFR. Experimental results in two domains show that GFR achieves much faster learning and has the capability to handle large-scale problems.

用于在线值函数近似的贪婪特征替换方法

研究目的：强化学习需要使用值函数近似来处理真实世界中的高维度状态空间问题，而值函数近似的效果很大程度上取决于合适特征表示的选取。表示扩展技术可使线性近似器更有效地表示值函数，然而现有的表示扩展技术或者只适用于低维度问题，或者在学习过程中不能及时发现重要的特征依赖。因此，本文提出了一种新型在线特征扩展技术——贪婪特征替换（GFR）。
创新要点：通过发现替换区域来执行特征替换的方式，贪婪特征替换方法增量地扩展特征表示，可以处理高维度状态空间问题，解决了传统在线扩展技术的局限。通过识别替换区域，贪婪特征替换方法优化了特征组合的加入顺序，提高了特征表示扩展的性能。
研究方法：引入虚拟时序差的概念，模拟新特征组合加入特征表示后的时序差，通过分析实际时序差与虚拟时序差累积情况的差距，选择新的特征组合。如差距超过给定阈值，这个特征组合将被作为新特征。在每次特征替换中，两个原有的特征被一个新的特征组合取代。
重要结论：理论分析表明，使用贪婪特征替换方法的值函数近似可以渐进地得到最优近似结果。对两个有代表性的学习问题的实验结果显示，贪婪特征替换可以达到更快的学习效果，而且具备处理大规模状态空间的能力。贪婪特征替换方法可以应用于各种在线值函数近似。

关键词：强化学习；值函数近似；特征依赖；在线扩展；特征替换

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Albus, J.S., 1971. A theory of cerebellar function. Math. Biosci., 10(1-2):25-61.

[2]Barto, A.G., Bradtke, S.J., Singh, S.P., 1995. Learning to act using real-time dynamic programming. Artif. Intell., 72(1-2):81-138.

[3]Buro, M., 1999. From simple features to sophisticated evaluation functions. Proc. 1st Int. Conf. on Computers and Games, p.126-145.

[4]de Hauwere, Y.M., Vrancx, P., Nowé, A., 2010. Generalized learning automata for multi-agent reinforcement learning. AI Commun., 23(4):311-324.

[5]Geramifard, A., Doshi, F., Redding, J., et al., 2011. Online discovery of feature dependencies. Proc. 28th Int. Conf. on Machine Learning, p.881-888.

[6]Geramifard, A., Dann, C., How, J.P., 2013. Off-policy learning combined with automatic feature expansion for solving large MDPs. Proc. 1st Multidisciplinary Conf. on Reinforcement Learning and Decision Making, p.29-33.

[7]Kaelbling, L.P., Littman, M.L., Moore, A.W., 1996. Reinforcement learning: a survey. J. Artif. Intell. Res., 4:237-285.

[8]Kolter, J.Z., Ng, A.Y., 2009. Near-Bayesian exploration in polynomial time. Proc. 26th Annual Int. Conf. on Machine Learning, p.513-520.

[9]Lagoudakis, M.G., Parr, R., 2003. Least-squares policy iteration. J. Mach. Learn. Res., 4(6):1107-1149.

[10]Pazis, J., Lagoudakis, M.G., 2009. Binary action search for learning continuous-action control policies. Proc. 26th Annual Int. Conf. on Machine Learning, p.793-800.

[11]Puterman, M.L., 1994. Markov Decision Processes—Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York, NY, p.139-161.

[12]Ratitch, B., Precup, D., 2004. Sparse distributed memories for on-line value-based reinforcement learning. Proc. 15th European Conf. on Machine Learning, p.347-358.

[13]Rummery, G.A., Niranjan, M., 1994. On-line Q-learning Using Connectionist Systems. Technical Report No. cued/ f-infeng/tr166, Engineering Department, Cambridge University.

[14]Singh, S., Jaakkola, T., Littman, M.L., et al., 2000. Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn., 38(3):287-308.

[15]Singh, S.P., Sutton, R.S., 1996. Reinforcement learning with replacing eligibility traces. Mach. Learn., 22(1-3):123-158.

[16]Singh, S.P., Yee, R.C., 1994. An upper bound on the loss from approximate optimal-value functions. Mach. Learn., 16(3):227-233.

[17]Sprague, N., Ballard, D., 2003. Multiple-goal reinforcement learning with modular sarsa(0). Proc. 18th Int. Joint Conf. on Artificial Intelligence, p.1445-1447.

[18]Sturtevant, N.R., White, A.M., 2006. Feature construction for reinforcement learning in hearts. Proc. 5th Int. Conf. on Computers and Games, p.122-134.

[19]Sutton, R.S., 1996. Generalization in reinforcement learning: successful examples using sparse coarse coding. Adv. Neur. Inform. Process. Syst., 8:1038-1044.

[20]Sutton, R.S., Barto, A.G., 1998. Reinforcement Learning: an Introduction. MIT Press, Cambridge, MA, USA, p.3-25.

[21]Tsitsiklis, J.N., 1994. Asynchronous stochastic approximation and Q-learning. Mach. Learn., 16(3):185-202.

[22]Tsitsiklis, J.N., van Roy, B., 1997. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Contr., 42(5):674-690.

[23]Watkins, C.J.C.H., Dayan, P., 1992. Q-learning. Mach. Learn., 8(3-4):279-292.

[24]Whiteson, S., Taylor, M.E., Stone, P., 2007. Adaptive Tile Coding for Value Function Approximation. Technical Report No. AI-TR-07-339, University of Texas at Austin.

Open peer comments: Debate/Discuss/Question/Opinion

<1>