|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2022 Vol.23 No.8 P.1205-1216
Fast code recommendation via approximate sub-tree matching
Abstract: Software developers often write code that has similar functionality to existing code segments. A code recommendation tool that helps developers reuse these code fragments can significantly improve their efficiency. Several methods have been proposed in recent years. Some use sequence matching algorithms to find the related recommendations. Most of these methods are time-consuming and can leverage only low-level textual information from code. Others extract features from code and obtain similarity using numerical feature vectors. However, the similarity of feature vectors is often not equivalent to the original code’s similarity. Structural information is lost during the process of transforming abstract syntax trees into vectors. We propose an approximate sub-tree matching based method to solve this problem. Unlike existing tree-based approaches that match feature vectors, it retains the tree structure of the query code in the matching process to find code fragments that best match the current query. It uses a fast approximation sub-tree matching algorithm by transforming the sub-tree matching problem into the match between the tree and the list. In this way, the structural information can be used for code recommendation tasks that have high time requirements. We have constructed several real-world code databases covering different languages and granularities to evaluate the effectiveness of our method. The results show that our method outperforms two compared methods, SENSORY and Aroma, in terms of the recall value on all the datasets, and can be applied to large datasets.
Key words: Code reuse; Code recommendation; Tree similarity; Structure information
1南京航空航天大学计算机科学与技术学院,中国南京市,211100
2工业和信息化部安全关键软件重点实验室,中国南京市,211100
3软件新技术与产业化协同创新中心,中国南京市,210016
摘要:软件开发人员通常需编写与已有代码具有类似功能的代码,而帮助开发人员重用这些代码片段的代码推荐工具可显著提高软件开发效率。近年来许多研究者开始关注这一领域,并提出多种代码推荐方法。一些研究者使用序列匹配算法得到相关代码,这些方法往往效率较低,且只能利用代码中的文本信息。另一些研究者从代码中提取特征并形成特征向量,从而计算代码间相似性并得到推荐结果。然而特征向量相似往往不代表原始代码相似,在将抽象语法树转换为向量的过程中存在结构信息丢失问题。对此,我们提出一种基于近似子树匹配的代码推荐方法。与现有基于特征向量匹配的方法不同,该方法在匹配过程中保留了查询代码的树型结构,从而找到与当前查询在结构上最为相似的代码片段。此外,通过哈希思想将子树匹配问题转化为树与列表间的匹配,使得抽象语法树信息可以用于对时间要求较高的代码推荐任务。为评估方法的有效性,构建了多个涵盖不同语言和粒度的代码数据集。实验结果表明,该方法在所有数据集上的召回率均优于两种对比方法—SENSORY和Aroma,且可以应用于大型数据集。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.2100379
CLC number:
TP311
Download Full Text:
Downloaded:
3580
Download summary:
<Click Here>Downloaded:
297Clicked:
1896
Cited:
0
On-line Access:
2022-08-22
Received:
2021-08-07
Revision Accepted:
2022-03-24
Crosschecked:
2022-08-29