CLC number: TP311
On-line Access: 2021-07-20
Received: 2020-04-21
Revision Accepted: 2020-12-23
Crosschecked: 2021-06-16
Cited: 0
Clicked: 4756
Citations: Bibtex RefMan EndNote GB/T7714
https://orcid.org/0000-0002-3451-8487
Junfang Jia, Valeriia Tumanian, Guoqiang Li. Discovering semantically related technical terms and web resources in Q&A discussions[J]. Frontiers of Information Technology & Electronic Engineering, 2021, 22(7): 969-985.
@article{title="Discovering semantically related technical terms and web resources in Q&A discussions",
author="Junfang Jia, Valeriia Tumanian, Guoqiang Li",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="22",
number="7",
pages="969-985",
year="2021",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2000186"
}
%0 Journal Article
%T Discovering semantically related technical terms and web resources in Q&A discussions
%A Junfang Jia
%A Valeriia Tumanian
%A Guoqiang Li
%J Frontiers of Information Technology & Electronic Engineering
%V 22
%N 7
%P 969-985
%@ 2095-9184
%D 2021
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2000186
TY - JOUR
T1 - Discovering semantically related technical terms and web resources in Q&A discussions
A1 - Junfang Jia
A1 - Valeriia Tumanian
A1 - Guoqiang Li
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 22
IS - 7
SP - 969
EP - 985
%@ 2095-9184
Y1 - 2021
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2000186
Abstract: A sheer number of techniques and web resources are available for software engineering practice and this number continues to grow. Discovering semantically similar or related technical terms and web resources offers the opportunity to design appealing services to facilitate information retrieval and information discovery. In this study, we extract technical terms and web resources from a community of question and answer (q&A) discussions and propose an approach based on a neural language model to learn the semantic representations of technical terms and web resources in a joint low-dimensional vector space. Our approach maps technical terms and web resources to a semantic vector space based only on the surrounding technical terms and web resources of a technical term (or web resource) in a discussion thread, without the need for mining the text content of the discussion. We apply our approach to Stack Overflow data dump of March 2018. Through both quantitative and qualitative analyses in the clustering, search, and semantic reasoning tasks, we show that the learnt technical-term and web-resource vector representations can capture the semantic relatedness of technical terms and web resources, and they can be exploited to support various search and semantic reasoning tasks, by means of simple K-nearest neighbor search and simple algebraic operations on the learnt vector representations in the embedding space.
[1]Agrawal R, Imieliński T, Swami A, 1993. Mining association rules between sets of items in large databases. ACM SIGMOD Rec, 22(2):207-216.
[2]Bansal M, Gimpel K, Livescu K, 2014. Tailoring continuous word representations for dependency parsing. Proc 52nd Annual Meeting of the Association for Computational Linguistics, p.809-815.
[3]Baroni M, Dinu G, Kruszewski G, 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. Proc 52nd Annual Meeting of the Association for Computational Linguistics, p.238-247.
[4]Barua A, Thomas SW, Hassan AE, 2014. What are developers talking about? An analysis of topics and trends in Stack Overflow. Empir Softw Eng, 19(3):619-654.
[5]Blei DM, Ng AY, Jordan MI, 2003. Latent Dirichlet allocation. J Mach Learn Res, 3(4-5):993-1022.
[6]Bullinaria JA, Levy JP, 2012. Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD. Behav Res Methods, 44(3):890-907.
[7]Chen WL, Zhang Y, Zhang M, 2014. Feature embedding for dependency parsing. Proc 25th Int Conf on Computational Linguistics, p.816-826.
[8]Collobert R, Weston J, Bottou L, et al., 2011. Natural language processing (almost) from scratch. J Mach Learn Res, 12:2493-2537.
[9]Grbovic M, Djuric N, Radosavljevic V, et al., 2015. Context-and content-aware embeddings for query rewriting in sponsored search. Proc 38th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.383-392.
[10]Gummidi SRB, Xie XK, Pedersen TB, 2019. A survey of spatial crowdsourcing. ACM Trans Database Syst, 44(2):8.
[11]Gutmann MU, Hyvärinen A, 2012. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J Mach Learn Res, 13(1):307-361.
[12]Harris ZS, 1954. Distributional structure. Word, 10:146-162.
[13]Hong LJ, Davison BD, 2010. Empirical study of topic modeling in Twitter. Proc 1st Workshop on Social Media Analytics, p.80-88.
[14]Huang Q, Xia X, Xing ZC, et al., 2018. API method recommendation without worrying about the task-API knowledge gap. Proc 33rd ACM/IEEE Int Conf on Automated Software Engineering, p.293-304.
[15]Jia JF, Li GQ, 2021. Learning natural ordering of tags in domain-specific Q&A sites. Front Inform Technol Electron Eng, 22(2):170-184.
[16]Jia JF, Tumanian V, Li GQ, 2020. In favour of or against multi-lingual Q&A sites? Exploring the evidence from user and knowledge perspectives. Behav Inform Technol, p.1-16.
[17]Levy O, Goldberg Y, 2014a. Dependency-based word embeddings. Proc 52nd Annual Meeting of the Association for Computational Linguistics, p.302-308.
[18]Levy O, Goldberg Y, 2014b. Linguistic regularities in sparse and explicit word representations. Proc 18th Conf on Computational Natural Language Learning, p.171-180.
[19]Levy O, Goldberg Y, 2014c. Neural word embedding as implicit matrix factorization. Proc 27th Int Conf on Neural Information Processing Systems, p.2177-2185.
[20]Levy O, Goldberg Y, Dagan I, 2015. Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Ling, 3:211-225.
[21]Li J, Xing ZC, Sun AX, 2019. LinkLive: discovering web learning resources for developers from Q&A discussions. World Wide Web, 22(4):1699-1725.
[22]MacQueen J, 1967. Some methods for classification and analysis of multivariate observations. Proc 5th Berkeley Symp on Mathematical Statistics and Probability, p.281-297.
[23]Mikolov T, Sutskever I, Chen K, et al., 2013a. Distributed representations of words and phrases and their compositionality. Proc 26th Int Conf on Neural Information Processing Systems, p.3111-3119.
[24]Mikolov T, Chen K, Corrado G, et al., 2013b. Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301.3781
[25]Mitra B, 2015. Exploring session context using distributed representations of queries and reformulations. Proc 38th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.3-12.
[26]Passos A, Kumar V, McCallum A, 2014. Lexicon infused phrase embeddings for named entity resolution. https://arxiv.org/abs/1404.5367
[27]Qiu SY, Cui Q, Bian J, et al., 2014. Co-learning of word representations and morpheme representations. Proc 25th Int Conf on Computational Linguistics, p.141-150.
[28]Rand WM, 1971. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc, 66(336):846-850.
[29]Ren XX, Xing ZC, Xia X, et al., 2019. Discovering, explaining and summarizing controversial discussions in community Q&A sites. Proc 34th IEEE/ACM Int Conf on Automated Software Engineering, p.151-162.
[30]Robillard M, Walker R, Zimmermann T, 2010. Recommendation systems for software engineering. IEEE Softw, 27(4):80-86.
[31]Rosen C, Shihab E, 2015. What are mobile developers asking about? A large scale study using Stack OverFlow. Empir Softw Eng, 21(3):1192-1223.
[32]Sillito J, Maurer F, Nasehi SM, et al., 2012. What makes a good code example?: a study of programming Q&A in StackOverflow. Proc IEEE Int Conf on Software Maintenance, p.25-34.
[33]Tian Y, Lo D, Lawall J, 2014a. Automated construction of a software-specific word similarity database. Proc Software Evolution Week-IEEE Conf on Software Maintenance, Reengineering, and Reverse Engineering, p.44-53.
[34]Tian Y, Lo D, Lawall J, 2014b. SEWordSim: software-specific word similarity database. Companion Proc 36th Int Conf on Software Engineering, p.568-571.
[35]Wang SW, Lo D, Jiang LX, 2012. Inferring semantically related software terms and their taxonomy by leveraging collaborative tagging. Proc 28th IEEE Int Conf on Software Maintenance, p.604-607.
[36]Wang SW, Lo D, Jiang LX, 2013. An empirical study on developer interactions in Stack Overflow. Proc 28th Annual ACM Symp on Applied Computing, p.1019-1024.
[37]Xia X, Bao LF, Lo D, et al., 2017. What do developers search for on the web? Empir Softw Eng, 22(6):3149-3185.
[38]Xie XK, Jin P, Yiu ML, et al., 2016. Enabling scalable geographic service sharing with weighted imprecise Voronoi cells. IEEE Trans Knowl Data Eng, 28(2):439-453.
[39]Xie XK, Lin X, Xu JL, et al., 2017. Reverse keyword-based location search. Proc IEEE 33rd Int Conf on Data Engineering, p.375-386.
[40]Xu BW, Xing ZC, Xia X, et al., 2017. AnswerBot: automated generation of answer summary to developers’ technical questions. Proc 32nd IEEE/ACM Int Conf on Automated Software Engineering, p.706-716.
[41]Xu C, Bai YL, Bian J, et al., 2014. RC-NET: a general framework for incorporating knowledge into word representations. Proc 23rd ACM Int Conf on Information and Knowledge Management, p.1219-1228.
[42]Yang JQ, Tan L, 2014. SWordNet: inferring semantically related words from software context. Empir Softw Eng, 19(6):1856-1886.
Open peer comments: Debate/Discuss/Question/Opinion
<1>