Journal of Zhejiang University

Frontiers of Information Technology & Electronic Engineering 2018 Vol.19 No.4 P.513-523

Supervised topic models with weighted words: multi-label document classification

Author(s): Yue-peng Zou, Ji-hong Ouyang, Xi-ming Li
Affiliation(s): 1. College of Computer Science and Technology, Jilin University, Changchun 130012, China more
Corresponding email(s): ouyj@jlu.edu.cn, liximing86@gmail.com
Key Words: Supervised topic model, Multi-label classification, Class frequency, Labeled latent Dirichlet allocation (L-LDA), Dependency-LDA

Share this article to： More <<< Previous Article \|Next Article >>>

Yue-peng Zou, Ji-hong Ouyang, Xi-ming Li. Supervised topic models with weighted words: multi-label document classification[J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19(4): 513-523.

@article{title="Supervised topic models with weighted words: multi-label document classification",
author="Yue-peng Zou, Ji-hong Ouyang, Xi-ming Li",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="19",
number="4",
pages="513-523",
year="2018",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1601668"
}

%0 Journal Article
%T Supervised topic models with weighted words: multi-label document classification
%A Yue-peng Zou
%A Ji-hong Ouyang
%A Xi-ming Li
%J Frontiers of Information Technology & Electronic Engineering
%V 19
%N 4
%P 513-523
%@ 2095-9184
%D 2018
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1601668

TY - JOUR
T1 - Supervised topic models with weighted words: multi-label document classification
A1 - Yue-peng Zou
A1 - Ji-hong Ouyang
A1 - Xi-ming Li
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 19
IS - 4
SP - 513
EP - 523
%@ 2095-9184
Y1 - 2018
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1601668

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: supervised topic modeling algorithms have been successfully applied to multi-label document classification tasks. Representative models include labeled latent Dirichlet allocation (L-LDA) and dependency-LDA. However, these models neglect the class frequency information of words (i.e., the number of classes where a word has occurred in the training data), which is significant for classification. To address this, we propose a method, namely the class frequency weight (CF-weight), to weight words by considering the class frequency knowledge. This CF-weight is based on the intuition that a word with higher (lower) class frequency will be less (more) discriminative. In this study, the CF-weight is used to improve L-LDA and dependency-LDA. A number of experiments have been conducted on real-world multi-label datasets. Experimental results demonstrate that CF-weight based algorithms are competitive with the existing supervised topic models.

词加权有监督主题模型：多标签文本分类

摘要：有监督主题模型已成功应用于多标签文本分类任务。代表性模型包括有监督隐含狄利克雷分配模型（labeled latent Dirichlet allocation, L-LDA）和判别隐含狄利克雷分配模型（dependency-LDA）。这些已有模型忽略单词类别频率信息，即训练集中单词出现的类别数量，对分类任务的影响。对此引入类别频率信息，提出一个类别频率词权重方法（class frequency weight, CF-weight）。CF-weight方法基于如下假设：具有较高（或较低）类别频率的单词在分类问题中具有较低（或较高）判别力。将CF-weight方法应用于L-LDA和dependency-LDA模型。实验结果表明，相比传统有监督主题模型，基于CF-weight的模型在多标签分类性能上具有优势。

关键词：有监督主题模型；多标签分类；类别频率；有监督隐含狄利克雷分配模型；判别隐含狄利克雷分配模型

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Blei DM, McAuliffe JD, 2007. Supervised topic models. 20^th Int Conf on Neural Information Processing Systems, p.121-128.

[2]Blei DM, Ng AY, Jordan MI, 2003. Latent Dirichlet allocation. J Mach Learn Res, 3:993-1022.

[3]Chang CC, Lin CJ, 2016. LIBSVM—a Library for Support Vector Machines. https://www.csie.ntu.edu.tw/~cjlin/libsvm/ [Accessed on May 22, 2018].

[4]Debole F, Sebastiani F, 2004. Supervised term weighting for automated text categorization. In: Sirmakessis S (Ed.), Text Mining and Its Applications. Springer, Berlin, p.81-97.

[5]Ghahramani Z, 2001. An introduction to hidden Markov models and Bayesian networks. Int J Patt Recogn Artif Intell, 15(1):9-42.

[6]Griffiths TL, Steyvers M, 2004. Finding scientific topics. Proc Nat Acad Sci USA, 101(Suppl 1):5228-5235.

[7]Guan H, Zhou JY, Guo MY, 2009. A class-feature-centroid classifier for text categorization. 18^th Int Conf on World Wide Web, p.201-210.

[8]Kim D, Kim S, Oh A, 2012. Dirichlet process with mixed random measures: a nonparametric topic model for labeled data. 29^th Int Conf on Machine Learning, p.675- 682.

[9]Lacoste-Julien S, Sha F, Jordan MI, 2008. DiscLDA: discriminative learning for dimensionality reduction and classification. 21^st Int Conf on Neural Information Processing Systems, p.897-904.

[10]Lee S, Kim J, Myaeng SH, 2015. An extension of topic models for text classification: a term weighting approach. Int Conf on Big Data and Smart Computing, p.217-224.

[11]Li XM, Ouyang JH, Zhou XT, 2015a. Centroid prior topic model for multi-label classification. Patt Recogn Lett, 62:8-13.

[12]Li XM, Ouyang JH, Zhou XT, 2015b. Supervised topic models for multi-label classification. Neurocomputing, 149:811- 819.

[13]Machine Learning & Knowledge Discovery Group, 2011. Learning from Multi-label Data. http://mlkd.csd.auth.gr/multilabel.html [Accessed on May 12, 2018].

[14]Madsen RE, Kauchak D, Elkan C, 2005. Modeling word burstiness using the Dirichlet distribution. 22^nd Int Conf on Machine Learning, p.545-552.

[15]Petterson J, Smola A, Caetano T, et al., 2010. Word features for latent Dirichlet allocation. 23^rd Int Conf on Neural Information Processing Systems, p.1921-1929.

[16]Ramage D, Hall D, Nallapati R, et al., 2009. Labeled LDA: a supervised topic model for credit attribution in multi- labeled corpora. Conf on Empirical Methods in Natural Language Processing, p.248-256.

[17]Ramage D, Manning CD, Dumais S, 2011. Partially labeled topic models for interpretable text mining. 17^th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, p.457-465.

[18]Reisinger J, Waters A, Silverthorn B, et al., 2010. Spherical topic models. Proc 27^th Int Conf on Machine Learning, p.1-8.

[19]Rubin TN, Chambers A, Smyth P, et al., 2012. Statistical topic models for multi-label document classification. Mach Learn, 88(1-2):157-208.

[20]Salton G, Buckley C, 1988. Term-weighting approaches in automatic text retrieval. Inform Process Manag, 24(5): 513-523.

[21]Shang LF, Chan KP, Pan GD, 2011. DTTM: a discriminative temporal topic model for facial expression recognition. 7^th Int Conf on Advances in Visual Computing, p.596-606.

[22]Tsoumakas G, Spyromitros-Xioufis E, Vilcek J, et al., 2011a. Mulan: a Java library for multi-label learning. J Mach Learn Res, 12(7):2411-2414.

[23]Tsoumakas G, Katakis I, Vlahavas I, 2011b. Random k-labelsets for multilabel classification. IEEE Trans Knowl Data Eng, 23(7):1079-1089.

[24]Wilson AT, Chew PA, 2010. Term weighting schemes for latent Dirichlet allocation. Human Language Technologies: Annual Conf of the North American Chapter of the Association for Computational Linguistics, p.465-473.

[25]Zhu J, Ahmed A, Xing EP, 2012. MedLDA: maximum margin supervised topic models. 26^th Annual Int Conf on Machine Learning, p.1257-1264.

Open peer comments: Debate/Discuss/Question/Opinion

<1>