Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

Topic modeling for large-scale text data

Abstract: This paper develops a novel online algorithm, namely moving average stochastic variational inference (MASVI), which applies the results obtained by previous iterations to smooth out noisy natural gradients. We analyze the convergence property of the proposed algorithm and conduct a set of experiments on two large-scale collections that contain millions of documents. Experimental results indicate that in contrast to algorithms named ‘stochastic variational inference’ and ‘SGRLD’, our algorithm achieves a faster convergence rate and better performance.

Key words: Latent Dirichlet allocation (LDA), Topic modeling, Online learning, Moving average

Chinese Summary  <28> 大规模文本数据的主题建模

目的:研究大规模数据的主题模型在线推理算法,针对随机变分推理算法中随机梯度误差较大的问题,提出一种移动平均随机变分推理算法。
创新点:使用多次迭代的随机梯度移动平均值近似代替真实随机梯度,以此减小随机梯度和真实梯度间的误差。
方法:以主题模型的基础模型潜在狄利克雷分配为载体展开研究。考虑不同次迭代的文本子集具有不同的词汇(表1),使用不同次迭代的随机项移动平均值近似代替真实随机梯度的随机项。为尽可能保证算法的精度,使用最近R次迭代的随机项(图2)并验证所提算法的收敛性。
结论:在随机变分推理算法基础上,提出一种移动平均随机变分推理算法,实现更好的文本主题建模效果和更快的收敛速度。

关键词组:潜在狄利克雷分配;主题模型;在线学习;移动平均值


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.1400352

CLC number:

TP391.1

Download Full Text:

Click Here

Downloaded:

2623

Download summary:

<Click Here> 

Downloaded:

1831

Clicked:

6768

Cited:

3

On-line Access:

2015-06-04

Received:

2014-10-15

Revision Accepted:

2015-03-12

Crosschecked:

2015-05-07

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE