CLC number: TP391
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2018-02-08
Cited: 0
Clicked: 6804
You-wei Wang, Li-zhou Feng. A new feature selection method for handling redundant information in text classification[J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19(2): 221-234.
@article{title="A new feature selection method for handling redundant information in text classification",
author="You-wei Wang, Li-zhou Feng",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="19",
number="2",
pages="221-234",
year="2018",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1601761"
}
%0 Journal Article
%T A new feature selection method for handling redundant information in text classification
%A You-wei Wang
%A Li-zhou Feng
%J Frontiers of Information Technology & Electronic Engineering
%V 19
%N 2
%P 221-234
%@ 2095-9184
%D 2018
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1601761
TY - JOUR
T1 - A new feature selection method for handling redundant information in text classification
A1 - You-wei Wang
A1 - Li-zhou Feng
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 19
IS - 2
SP - 221
EP - 234
%@ 2095-9184
Y1 - 2018
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1601761
Abstract: feature selection is an important approach to dimensionality reduction in the field of text classification. Because of the difficulty in handling the problem that the selected features always contain redundant information, we propose a new simple feature selection method, which can effectively filter the redundant features. First, to calculate the relationship between two words, the definitions of word frequency based relevance and correlative redundancy are introduced. Furthermore, an optimal feature selection (OFS) method is chosen to obtain a feature subset FS1. Finally, to improve the execution speed, the redundant features in FS1 are filtered by combining a predetermined threshold, and the filtered features are memorized in the linked lists. Experiments are carried out on three datasets (WebKB, 20-Newsgroups, and Reuters-21578) where in support vector machines and naï;ve Bayes are used. The results show that the classification accuracy of the proposed method is generally higher than that of typical traditional methods (information gain, improved Gini index, and improved comprehensively measured feature selection) and the OFS methods. Moreover, the proposed method runs faster than typical mutual information-based methods (improved and normalized mutual information-based feature selections, and multilabel feature selection based on maximum dependency and minimum redundancy) while simultaneously ensuring classification accuracy. Statistical results validate the effectiveness of the proposed method in handling redundant information in text classification.
[1]Alatas B, 2010. Chaotic harmony search algorithms. Appl Math Comput, 216(9):2687-2699.
[2]Apte C, Damerau F, Weiss S, 1999. Text mining with decision trees and decision rules. Conf on Automated Learning and Discovery, p.169-198.
[3]Battiti R, 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neur Netw, 5(4):537-550.
[4]Breiman L, Friedman JH, Olshen RA, et al., 1984. Classification and Regression Trees. Wadsworth International Group, Monterey, USA.
[5]Caruana G, Li MZ, Liu Y, 2013. An ontology enhanced parallel SVM for scalable spam filter training. Neurocomputing, 108:45-57.
[6]Cevenini G, Barbini E, Massai MR, et al., 2013. A naïve Bayes classifier for planning transfusion requirements in heart surgery. J Eval Clin Pract, 19(1):25-29.
[7]Chang CC, Lin CJ, 2007. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol, 2(3), Article 27.
[8]Chen JN, Huang HK, Tian SF, et al., 2009. Feature selection for text classification with naïve Bayes. Exp Syst Appl, 36(3):5432-5435.
[9]Dallachiesa M, Palpanas T, Ilyas IF, 2014. Top-k nearest neighbor search in uncertain data series. Proc VLDB Endowm, 8(1):13-24.
[10]De Souza AF, Pedroni F, Oliveira E, et al., 2009. Automated multi-label text categorization with VG-RAM weightless neural networks. Neurocomputing, 72(10-12):2209-2217.
[11]Drucker H, Wu DH, Vapnik VN, 1999. Support vector machines for spam categorization. IEEE Trans Neur Netw, 10(5):1048-1054.
[12]Elghazel H, Aussem A, Gharroudi O, et al., 2016. Ensemble multi-label text categorization based on rotation forest and latent semantic indexing. Exp Syst Appl, 57:1-11.
[13]Estevez PA, Tesmer M, Perez CA, et al., 2009. Normalized mutual information feature selection. IEEE Trans Neur Netw, 20(2):189-201.
[14]Geem ZW, Kim JH, Loganathan GV, 2001. A new heuristic optimization algorithm: harmony search. Simulation, 76(2): 60-68.
[15]Han M, Ren WJ, 2015. Global mutual information-based feature selection approach using single-objective and multi-objective optimization. Neurocomputing, 168:47-54.
[16]Hoque N, Bhattacharyya DK, Kalita JK, 2014. MIFS-ND: a mutual information-based feature selection method. Exp Syst Appl, 41(14):6371-6385.
[17]Jing LP, Ng MK, Huang JZ, 2010. Knowledge-based vector space model for text clustering. Knowl Inform Syst, 25(1):35-55.
[18]Joachims T, 1998. Text categorization with support vector machines: learning with many relevant features. Proc 10th European Conf on Machine Learning, p.137-142.
[19]Kruskal JB, Wish M, 1978. Multidimensional Scaling. Sage, London, UK.
[20]Lin YJ, Hu QH, Liu JH, et al., 2015. Multi-label feature selection based on max-dependency and min-redundancy. Neurocomputing, 168:92-103.
[21]Liu H, Yu L, 2005. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng, 17(4):491-502.
[22]McCallum A, Nigam K, 2001. A comparison of event models for naive Bayes text classification. AAAI-98 Workshop on Learning for Text Categorization, p.41-48.
[23]Napoletano P, Colace F, De Santo M, et al., 2012. Text classification using a graph of terms. 6th Int Conf on Complex, Intelligent and Software Intensive Systems. p.1030-1035.
[24]Peng HC, Long FH, Ding C, 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Patt Anal Mach Intell, 27(8):1226-1238.
[25]Porter MF, 1997. An algorithm for suffix stripping. In: Jones KS, Willett P (Eds.), Readings in Information Retrieval. Morgan Kaufmann Publishers Inc., San Francisco, USA, p.313-316.
[26]Schneider KM, 2003. A comparison of event models for naive Bayes anti-spam e-mail filtering. Proc 10th Conf on European Chapter of the Association for Computational Linguistics, p.307-314.
[27]Sebastiani F, 2002. Machine learning in automated text categorization. ACM Comput Surv, 34(1):1-47.
[28]Shang WQ, Huang HK, Zhu HB, et al., 2007. A novel feature selection algorithm for text categorization. Exp Syst Appl, 33(1):1-5.
[29]Taheri SM, Hesamian G, 2013. A generalization of the Wilcoxon signed-rank test and its applications. Stat Paper, 54(2):457-470.
[30]Tenenhaus M, Vinzi VE, Chatelin YM, et al., 2005. PLS path modeling. Comput Stat Data Anal, 48(1):159-205.
[31]Uğuz H, 2011. A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl-Based Syst, 24(7):1024-1032.
[32]Wang DQ, Zhang H, Liu R, et al., 2012. Feature selection based on term frequency and T-test for text categorization. Proc 21st ACM Int Conf on Information and Knowledge Management, p.1482-1486.
[33]Wang YW, Liu YN, Feng LZ, et al., 2014. Novel feature selection method based on harmony search for email classification. Knowl-Based Syst, 73:311-323.
[34]Wilcoxon F, 1945. Individual comparisons by ranking methods. Biom Bull, 1(6):80-83.
[35]Yang JM, Liu YN, Zhu XD, et al., 2012. A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inform Process Manag, 48(4):741-754.
[36]Yan J, Liu N, Zhang B, et al., 2005. OCFS: optimal orthogonal centroid feature selection for text categorization. Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.122-129.
[37]Yang JM, Qu ZY, Liu ZY, 2014. Improved feature-selection method considering the imbalance problem in text categorization. Sci World J, 2014:625342.
[38]Yang YM, Pedersen JO, 1997. A comparative study on feature selection in text categorization. Proc 14th Int Conf on Machine Learning, p.412-420.
[39]Zhang W, Yoshida T, Tang XJ, 2011. A comparative study of TF*IDF, LSI and multi-words for text classification. Exp Syst Appl, 38(3):2758-2765.
[40]Zhang W, Clark RAJ, Wang YY, et al., 2016. Unsupervised language identification based on latent Dirichlet Allocation. Comput Speech Lang, 39:47-66.
[41]Zhang YS, Zhang ZG, 2012. Feature subset selection with cumulate conditional mutual information minimization. Exp Syst Appl, 39(5):6078-6088.
Open peer comments: Debate/Discuss/Question/Opinion
<1>