Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

Topic discovery and evolution in scientific literature based on content and citations

Abstract: Researchers across the globe have been increasingly interested in the manner in which important research topics evolve over time within the corpus of scientific literature. In a dataset of scientific articles, each document can be considered to comprise both the words of the document itself and its citations of other documents. In this paper, we propose a citation-content-latent Dirichlet allocation (LDA) topic discovery method that accounts for both document citation relations and the content of the document itself via a probabilistic generative model. The citation-content-LDA topic model exploits a two-level topic model that includes the citation information for ‘father’ topics and text information for sub-topics. The model parameters are estimated by a collapsed Gibbs sampling algorithm. We also propose a topic evolution algorithm that runs in two steps: topic segmentation and topic dependency relation calculation. We have tested the proposed citation-content-LDA model and topic evolution algorithm on two online datasets, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) and IEEE Computer Society (CS), to demonstrate that our algorithm effectively discovers important topics and reflects the topic evolution of important research themes. According to our evaluation metrics, citation-content-LDA outperforms both content-LDA and citation-LDA.

Key words: Topic extraction; Topic evolution; Evaluation method

Chinese Summary  <28> 基于内容和引用的科研文献的主题发现和演化

概要:科研文献数据库中的重要主题随时间的演化的方式已经越来越受到全球研究者的关注。在一个科研论文数据集中,任何一篇论文可以被认为是由组成论文本身的词和论文引用的文献所组成的。在本文中,我们提出了一种名为"Citation-content-LDA (latent Dirichlet allocation)"的主题发现方法,该方法在一个概率生成模型中同时生成文献的引用关系和文献本身的词。Citation-content-LDA模型利用了一种两层结构的主题模型,即利用引用信息生成父主题和利用文本信息生成子主题。模型参数通过吉布斯采样算法来估计。我们还提出了一个主题演化算法,该算法包括主题分割和主题间依赖关系计算两个步骤。我们在IEEE Transactionson Pattern Analysis and Machine Intelligence (PAMI)和IEEE Computer Society (CS)两个数据集上测试了提出的Citation-content-LDA模型和主题演化算法,证明了我们提出的算法能有效的发现重要的主题和反映重要研究主题的主题演化情况。经过我们的评价指标的评测,Citation-content-LDA算法的性能优于Content-LDA和Citation-LDA算法。

关键词组:主题提取;主题演化;评价方法


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.1601125

CLC number:

TP391

Download Full Text:

Click Here

Downloaded:

2606

Clicked:

7417

Cited:

0

On-line Access:

2024-08-27

Received:

2023-10-17

Revision Accepted:

2024-05-08

Crosschecked:

2017-09-22

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE