Journal of Zhejiang University

ENGINEERING Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

Discovering semantically related technical terms and web resources in Q&A discussions

Author(s): Junfang Jia, Valeriia Tumanian, Guoqiang Li
Affiliation(s): School of Computer and Network Engineering, Shanxi Datong University, Datong 037009, China; more
Corresponding email(s): li.g@sjtu.edu.cn
Key Words: Technical terms, Web resources, Word embedding, Q&A web site, Clustering tasks, Recommendation tasks

Share this article to： More <<< Previous Paper \|Next Paper >>>

Junfang Jia, Valeriia Tumanian, Guoqiang Li. Discovering semantically related technical terms and web resources in Q&A discussions[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2000186

@article{title="Discovering semantically related technical terms and web resources in Q&A discussions",
author="Junfang Jia, Valeriia Tumanian, Guoqiang Li",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2000186"
}

%0 Journal Article
%T Discovering semantically related technical terms and web resources in Q&A discussions
%A Junfang Jia
%A Valeriia Tumanian
%A Guoqiang Li
%J Frontiers of Information Technology & Electronic Engineering
%P 969-985
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2000186"

TY - JOUR
T1 - Discovering semantically related technical terms and web resources in Q&A discussions
A1 - Junfang Jia
A1 - Valeriia Tumanian
A1 - Guoqiang Li
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 969
EP - 985
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2000186"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: A sheer number of techniques and web resources are available for software engineering practice and this number continues to grow. Discovering semantically similar or related technical terms and web resources offers the opportunity to design appealing services to facilitate information retrieval and information discovery. In this study, we extract technical terms and web resources from a community of question and answer (Qɪ) discussions and propose an approach based on a neural language model to learn the semantic representations of technical terms and web resources in a joint low-dimensional vector space. Our approach maps technical terms and web resources to a semantic vector space based only on the surrounding technical terms and web resources of a technical term (or web resource) in a discussion thread, without the need for mining the text content of the discussion. We apply our approach to Stack Overflow data dump of March 2018. Through both quantitative and qualitative analyses in the clustering, search, and semantic reasoning tasks, we show that the learnt technical-term and web-resource vector representations can capture the semantic relatedness of technical terms and web resources, and they can be exploited to support various search and semantic reasoning tasks, by means of simple K-nearest neighbor search and simple algebraic operations on the learnt vector representations in the embedding space.

从问答讨论中发现语义相关的技术术语和网络资源

贾俊芳¹，Valeriia TUMANIAN²，李国强²
¹山西大同大学计算机与网络工程学院，中国大同市，037009
²上海交通大学软件学院，中国上海市，200240
摘要：目前网络上拥有大量可用于软件工程实践的技术和网络资源，并且这个数量还在持续增长。发现语义相似或相关的技术术语和网络资源，可以设计吸引人的服务，以促进信息检索和信息发现的机会。本文从问答（Q&A）讨论的社区中提取技术术语和网络资源，并提出一种基于神经网络语言模型的技术术语和网络资源在联合低维向量空间中的语义表示方法。方法仅基于讨论线程中技术术语（或网络资源）的周围技术术语和web资源，将技术术语和网络资源映射到语义向量空间，而不需挖掘讨论的文本内容。将方法应用于2018年3月的堆栈溢出数据转储。对聚类、搜索和语义推理任务的定量和定性分析表明，所学习的技术术语和网络资源向量表示可以捕获技术术语和网络资源的语义相关性，通过简单的K近邻搜索和在嵌入空间中对学习的向量表示作简单的代数运算，可以支持各种搜索和语义推理任务。

关键词组：技术术语；网络资源；词语嵌入；问答网站；聚类任务；推荐任务

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Agrawal R, Imieliński T, Swami A, 1993. Mining association rules between sets of items in large databases. ACM SIGMOD Rec, 22(2):207-216.

[2]Bansal M, Gimpel K, Livescu K, 2014. Tailoring continuous word representations for dependency parsing. Proc 52^nd Annual Meeting of the Association for Computational Linguistics, p.809-815.

[3]Baroni M, Dinu G, Kruszewski G, 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. Proc 52^nd Annual Meeting of the Association for Computational Linguistics, p.238-247.

[4]Barua A, Thomas SW, Hassan AE, 2014. What are developers talking about? An analysis of topics and trends in Stack Overflow. Empir Softw Eng, 19(3):619-654.

[5]Blei DM, Ng AY, Jordan MI, 2003. Latent Dirichlet allocation. J Mach Learn Res, 3(4-5):993-1022.

[6]Bullinaria JA, Levy JP, 2012. Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD. Behav Res Methods, 44(3):890-907.

[7]Chen WL, Zhang Y, Zhang M, 2014. Feature embedding for dependency parsing. Proc 25^th Int Conf on Computational Linguistics, p.816-826.

[8]Collobert R, Weston J, Bottou L, et al., 2011. Natural language processing (almost) from scratch. J Mach Learn Res, 12:2493-2537.

[9]Grbovic M, Djuric N, Radosavljevic V, et al., 2015. Context-and content-aware embeddings for query rewriting in sponsored search. Proc 38^th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.383-392.

[10]Gummidi SRB, Xie XK, Pedersen TB, 2019. A survey of spatial crowdsourcing. ACM Trans Database Syst, 44(2):8.

[11]Gutmann MU, Hyvärinen A, 2012. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J Mach Learn Res, 13(1):307-361.

[12]Harris ZS, 1954. Distributional structure. Word, 10:146-162.

[13]Hong LJ, Davison BD, 2010. Empirical study of topic modeling in Twitter. Proc 1^st Workshop on Social Media Analytics, p.80-88.

[14]Huang Q, Xia X, Xing ZC, et al., 2018. API method recommendation without worrying about the task-API knowledge gap. Proc 33^rd ACM/IEEE Int Conf on Automated Software Engineering, p.293-304.

[15]Jia JF, Li GQ, 2021. Learning natural ordering of tags in domain-specific Q&A sites. Front Inform Technol Electron Eng, 22(2):170-184.

[16]Jia JF, Tumanian V, Li GQ, 2020. In favour of or against multi-lingual Q&A sites? Exploring the evidence from user and knowledge perspectives. Behav Inform Technol, p.1-16.

[17]Levy O, Goldberg Y, 2014a. Dependency-based word embeddings. Proc 52^nd Annual Meeting of the Association for Computational Linguistics, p.302-308.

[18]Levy O, Goldberg Y, 2014b. Linguistic regularities in sparse and explicit word representations. Proc 18^th Conf on Computational Natural Language Learning, p.171-180.

[19]Levy O, Goldberg Y, 2014c. Neural word embedding as implicit matrix factorization. Proc 27^th Int Conf on Neural Information Processing Systems, p.2177-2185.

[20]Levy O, Goldberg Y, Dagan I, 2015. Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Ling, 3:211-225.

[21]Li J, Xing ZC, Sun AX, 2019. LinkLive: discovering web learning resources for developers from Q&A discussions. World Wide Web, 22(4):1699-1725.

[22]MacQueen J, 1967. Some methods for classification and analysis of multivariate observations. Proc 5^th Berkeley Symp on Mathematical Statistics and Probability, p.281-297.

[23]Mikolov T, Sutskever I, Chen K, et al., 2013a. Distributed representations of words and phrases and their compositionality. Proc 26^th Int Conf on Neural Information Processing Systems, p.3111-3119.

[24]Mikolov T, Chen K, Corrado G, et al., 2013b. Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301.3781

[25]Mitra B, 2015. Exploring session context using distributed representations of queries and reformulations. Proc 38^th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.3-12.

[26]Passos A, Kumar V, McCallum A, 2014. Lexicon infused phrase embeddings for named entity resolution. https://arxiv.org/abs/1404.5367

[27]Qiu SY, Cui Q, Bian J, et al., 2014. Co-learning of word representations and morpheme representations. Proc 25^th Int Conf on Computational Linguistics, p.141-150.

[28]Rand WM, 1971. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc, 66(336):846-850.

[29]Ren XX, Xing ZC, Xia X, et al., 2019. Discovering, explaining and summarizing controversial discussions in community Q&A sites. Proc 34^th IEEE/ACM Int Conf on Automated Software Engineering, p.151-162.

[30]Robillard M, Walker R, Zimmermann T, 2010. Recommendation systems for software engineering. IEEE Softw, 27(4):80-86.

[31]Rosen C, Shihab E, 2015. What are mobile developers asking about? A large scale study using Stack OverFlow. Empir Softw Eng, 21(3):1192-1223.

[32]Sillito J, Maurer F, Nasehi SM, et al., 2012. What makes a good code example?: a study of programming Q&A in StackOverflow. Proc IEEE Int Conf on Software Maintenance, p.25-34.

[33]Tian Y, Lo D, Lawall J, 2014a. Automated construction of a software-specific word similarity database. Proc Software Evolution Week-IEEE Conf on Software Maintenance, Reengineering, and Reverse Engineering, p.44-53.

[34]Tian Y, Lo D, Lawall J, 2014b. SEWordSim: software-specific word similarity database. Companion Proc 36^th Int Conf on Software Engineering, p.568-571.

[35]Wang SW, Lo D, Jiang LX, 2012. Inferring semantically related software terms and their taxonomy by leveraging collaborative tagging. Proc 28^th IEEE Int Conf on Software Maintenance, p.604-607.

[36]Wang SW, Lo D, Jiang LX, 2013. An empirical study on developer interactions in Stack Overflow. Proc 28^th Annual ACM Symp on Applied Computing, p.1019-1024.

[37]Xia X, Bao LF, Lo D, et al., 2017. What do developers search for on the web? Empir Softw Eng, 22(6):3149-3185.

[38]Xie XK, Jin P, Yiu ML, et al., 2016. Enabling scalable geographic service sharing with weighted imprecise Voronoi cells. IEEE Trans Knowl Data Eng, 28(2):439-453.

[39]Xie XK, Lin X, Xu JL, et al., 2017. Reverse keyword-based location search. Proc IEEE 33^rd Int Conf on Data Engineering, p.375-386.

[40]Xu BW, Xing ZC, Xia X, et al., 2017. AnswerBot: automated generation of answer summary to developers’ technical questions. Proc 32^nd IEEE/ACM Int Conf on Automated Software Engineering, p.706-716.

[41]Xu C, Bai YL, Bian J, et al., 2014. RC-NET: a general framework for incorporating knowledge into word representations. Proc 23^rd ACM Int Conf on Information and Knowledge Management, p.1219-1228.

[42]Yang JQ, Tan L, 2014. SWordNet: inferring semantically related words from software context. Empir Softw Eng, 19(6):1856-1886.

Open peer comments: Debate/Discuss/Question/Opinion

<1>