JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering 2017 Vol.18 No.11 P.1744-1753

A feature selection approach based on a similarity measure for software defect prediction

Author(s): Qiao Yu, Shu-juan Jiang, Rong-cun Wang, Hong-yang Wang
Affiliation(s): School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China; more
Corresponding email(s): yuqiao@cumt.edu.cn, shjjiang@cumt.edu.cn
Key Words: Software defect prediction, Feature selection, Similarity measure, Feature weights, Feature ranking list

Share this article to： More <<< Previous Article \|Next Article >>>

Qiao Yu, Shu-juan Jiang, Rong-cun Wang, Hong-yang Wang. A feature selection approach based on a similarity measure for software defect prediction[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(11): 1744-1753.

@article{title="A feature selection approach based on a similarity measure for software defect prediction",
author="Qiao Yu, Shu-juan Jiang, Rong-cun Wang, Hong-yang Wang",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="18",
number="11",
pages="1744-1753",
year="2017",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1601322"
}

%0 Journal Article
%T A feature selection approach based on a similarity measure for software defect prediction
%A Qiao Yu
%A Shu-juan Jiang
%A Rong-cun Wang
%A Hong-yang Wang
%J Frontiers of Information Technology & Electronic Engineering
%V 18
%N 11
%P 1744-1753
%@ 2095-9184
%D 2017
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1601322

TY - JOUR
T1 - A feature selection approach based on a similarity measure for software defect prediction
A1 - Qiao Yu
A1 - Shu-juan Jiang
A1 - Rong-cun Wang
A1 - Hong-yang Wang
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 18
IS - 11
SP - 1744
EP - 1753
%@ 2095-9184
Y1 - 2017
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1601322

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: software defect prediction is aimed to find potential defects based on historical data and software features. Software features can reflect the characteristics of software modules. However, some of these features may be more relevant to the class (defective or non-defective), but others may be redundant or irrelevant. To fully measure the correlation between different features and the class, we present a feature selection approach based on a similarity measure (SM) for software defect prediction. First, the feature weights are updated according to the similarity of samples in different classes. Second, a feature ranking list is generated by sorting the feature weights in descending order, and all feature subsets are selected from the feature ranking list in sequence. Finally, all feature subsets are evaluated on a k-nearest neighbor (KNN) model and measured by an area under curve (AUC) metric for classification performance. The experiments are conducted on 11 National Aeronautics and Space Administration (NASA) datasets, and the results show that our approach performs better than or is comparable to the compared feature selection approaches in terms of classification performance.

一种面向软件缺陷预测的相似性度量特征选择方法

概要：软件缺陷预测旨在通过历史数据和能反映软件模块特性的软件特征来发现潜在缺陷。然而，有的特征可能与类别（有缺陷或无缺陷）的相关性较高，有的特征可能是冗余的或无关的。针对软件缺陷预测中不同特征与类别的相关性差异，本文提出一种基于相似性度量（similarity measure, SM）的特征选择方法。首先，根据不同类样本间的相似性来更新特征权重；然后，按照特征权重值降序排列生成特征排序列表，并依次选取特征排序列表中的所有特征子集；最后，在KNN（k-nearest neighbor）模型上验证所有特征子集的分类性能，并采用AUC（areaunder curve）指标进行度量。在11个美国航空航天局（NASA）数据集上进行实验验证，结果表明，与其它四种特征选择方法相比，本文方法具有与之相当甚至更高的分类性能。

关键词：软件缺陷预测；特征选择；相似性度量；特征权重；特征排序列表

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Aha, D.W., Kibler, D., Albert, M.K., 1991. Instance-based learning algorithms. Mach. Learn., 6(1):37-66.

[2]Catal, C., Diri, B., 2009. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Inform. Sci., 179(8):1040-1058.

[3]Duch, W., Wieczorek, T., Biesiada, J., et al., 2004. Comparison of feature ranking methods based on information entropy. Int. Joint Conf. on Neural Networks, p.1415-1419.

[4]Galar, M., Fernández, A., Barrenechea, E., et al., 2012. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C, 42(4):463-484.

[5]Gao, K., Khoshgoftaar, T.M., Wang, H., et al., 2011. Choosing software metrics for defect prediction: an investigation on feature selection techniques. Softw. Pract. Exper., 41(5):579-606.

[6]Ghareb, A.S., Bakar, A.A., Hamdan, A.R., 2016. Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst. Appl., 49:31-47.

[7]Gray, D., Bowes, D., Davey, N., et al., 2011. The misuse of the NASA metrics data program data sets for automated software defect prediction. Int. Conf. on Evaluation and Assessment in Software Engineering, p.96-103.

[8]Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157-1182.

[9]Hall, M.A., 1999. Correlation-Based Feature Selection for Machine Learning. University of Waikato, Hamilton, New Zealand.

[10]Halstead, M.H., 1977. Elements of Software Science. Elsevier, New York, USA.

[11]Han, Y., Park, K., Guan, D., et al., 2013. Topological similarity-based feature selection for graph classification. Comput. J., 58(9):1884-1893.

[12]Holte, R.C., 1993. Very simple classification rules perform well on most commonly used datasets. Mach. Learn., 11(1):63-90.

[13]Huang, J., Ling, C.X., 2005. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng., 17(3):299-310.

[14]Jiang, Y., Lin, J., Cukic, B., et al., 2009. Variance analysis in software fault prediction models. Int. Symp. on Software Reliability Engineering, p.99-108.

[15]Jing, X., Ying, S., Zhang, Z., et al., 2014a. Dictionary learning based software defect prediction. Int. Conf. on Software Engineering, p.414-423.

[16]Jing, X., Zhang, Z., Ying, S., et al., 2014b. Software defect prediction based on collaborative representation classification. Companion of Int. Conf. on Software Engineering, p.632-633.

[17]Jing, X., Wu, F., Dong, X., et al., 2015. Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. Joint Meeting on Foundations of Software Engineering, p.496-507.

[18]Karegowda, A.G., Manjunath, A.S., Jayaram, M.A., 2010. Comparative study of attribute selection using gain ratio and correlation based feature selection. Int. J. Inform. Technol. Knowl. Manag., 2(2):271-277.

[19]Khoshgoftaar, T.M., Gao, K., Napolitano, A., et al., 2014. A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Inform. Syst. Front., 16(5):801-822.

[20]Kira, K., Rendell, L.A., 1992. A practical approach to feature selection. Int. Workshop on Machine Learning, p.249-256.

[21]Kononenko, I., 1994. Estimating attributes: analysis and extensions of RELIEF. European Conf. on Machine Learning, p.171-182.

[22]Laradji, I.H., Alshayeb, M., Ghouti, L., 2015. Software defect prediction using ensemble learning on selected features. Inform. Softw. Technol., 58:388-402.

[23]Liu, H., Yu, L., 2005. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng., 17(4):491-502.

[24]Liu, H., Sun, J., Liu, L., et al., 2009. Feature selection with dynamic mutual information. Patt. Recogn., 42(7):1330-1339.

[25]Liu, H., Motoda, H., Setiono, R., et al., 2010. Feature selection: an ever evolving frontier in data mining. Int. Workshop on Feature Selection in Data Mining, p.4-13.

[26]Liu, S., Chen, X., Liu, W., et al., 2014. FECAR: a feature selection framework for software defect prediction. Annual Computer Software and Applications Conf., p.426-435.

[27]McCabe, T.J., 1976. A complexity measure. IEEE Trans. Softw. Eng., SE-2(4):308-320.

[28]Miao, L., Liu, M., Zhang, D., 2012. Cost-sensitive feature selection with application in software defect prediction. Int. Conf. on Pattern Recognition, p.967-970.

[29]Nam, J., Kim, S., 2015a. CLAMI: defect prediction on unlabeled datasets. Int. Conf. on Automated Software Engineering, p.452-463.

[30]Nam, J., Kim, S., 2015b. Heterogeneous defect prediction. Joint Meeting on Foundations of Software Engineering, p.508-519.

[31]Shepperd, M., Song, Q., Sun, Z., et al., 2013. Data quality: some comments on the NASA software defect datasets. IEEE Trans. Softw. Eng., 39(9):1208-1215.

[32]Tantithamthavorn, C., McIntosh, S., Hassan, A.E., et al., 2016. Automated parameter optimization of classification techniques for defect prediction models. Int. Conf. on Software Engineering, p.321-332.

[33]Uysal, A.K., Gunal, S., 2012. A novel probabilistic feature selection method for text classification. Knowl. Based Syst., 36:226-235.

[34]Wang, H., Khoshgoftaar, T.M., Seliya, N., 2015. On the stability of feature selection methods in software quality prediction: an empirical investigation. Int. J. Softw. Eng. Know. Eng., 25:1467-1490.

[35]Wang, Z., Li, M., Li, J., 2015. A multi-objective evolutionary algorithm for feature selection based on mutual information with a new redundancy measure. Inform. Sci., 307:73-88.

[36]Wilcoxon, F., 1945. Individual comparisons by ranking methods. Biometr. Bull., 1(6):80-83.

[37]Xu, J., Zhou, Y., Chen, L., et al., 2012. An unsupervised feature selection approach based on mutual information. J. Comput. Res. Dev., 49(2):372-382 (in Chinese).

[38]Xue, B., Zhang, M., Browne, W.N., 2013. Particle swarm optimization for feature selection in classification: a multi-objective approach. IEEE Trans. Cybern., 43(6):1656-1671.

[39]Yang, S., Gu, J., 2004. Feature selection based on mutual information and redundancy-synergy coefficient. J. Zhejiang Univ.-Sci., 5(11):1382-1391.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

一种面向软件缺陷预测的相似性度量特征选择方法

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference