JZUS - Journal of Zhejiang University SCIENCE

Journal of Zhejiang University SCIENCE C 2013 Vol.14 No.6 P.407-416

http://doi.org/10.1631/jzus.C1200303

Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization

Author(s): Min Du, Xing-shu Chen
Affiliation(s): School of Computer Science, Sichuan University, Chengdu 610065, China
Corresponding email(s): doingscu@gmail.com, chenxsh@scu.edu.cn
Key Words: k-nearest neighbors (kNN), Text categorization, Accelerating strategy, Principal component analysis (PCA)

Share this article to： More <<< Previous Article \|Next Article >>>

Min Du, Xing-shu Chen. Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization[J]. Journal of Zhejiang University Science C, 2013, 14(6): 407-416.

@article{title="Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization",
author="Min Du, Xing-shu Chen",
journal="Journal of Zhejiang University Science C",
volume="14",
number="6",
pages="407-416",
year="2013",
publisher="Zhejiang University Press & Springer",
doi="10.1631/jzus.C1200303"
}

%0 Journal Article
%T Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization
%A Min Du
%A Xing-shu Chen
%J Journal of Zhejiang University SCIENCE C
%V 14
%N 6
%P 407-416
%@ 1869-1951
%D 2013
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.C1200303

TY - JOUR
T1 - Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization
A1 - Min Du
A1 - Xing-shu Chen
J0 - Journal of Zhejiang University Science C
VL - 14
IS - 6
SP - 407
EP - 416
%@ 1869-1951
Y1 - 2013
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.C1200303

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: text categorization is a significant technique to manage the surging text data on the Internet. The k-nearest neighbors (kNN) algorithm is an effective, but not efficient, classification model for text categorization. In this paper, we propose an effective strategy to accelerate the standard kNN, based on a simple principle: usually, near points in space are also near when they are projected into a direction, which means that distant points in the projection direction are also distant in the original space. Using the proposed strategy, most of the irrelevant points can be removed when searching for the k-nearest neighbors of a query point, which greatly decreases the computation cost. Experimental results show that the proposed strategy greatly improves the time performance of the standard kNN, with little degradation in accuracy. Specifically, it is superior in applications that have large and high-dimensional datasets.

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Aghbari, Z., 2005. Array-index: a plug&search K nearest neighbors method for high-dimensional data. Data Knowl. Eng., 52(3):333-352.

[2]Chen, E.H., Lin, Y.G., Xiong, H., Luo, Q.M., Ma, H.P., 2011. Exploiting probabilistic topic models to improve text categorization under class imbalance. Inf. Process. Manag., 47(2):202-214.

[3]Coussement, K., van den Poel, D., 2008. Integrating the voice of customers through call center emails into a decision support system for churn prediction. Inf. Manag., 45(3):164-174.

[4]de Souza, A.F., Pedroni, F., Oliveira, E., Ciarelli, P.M., Henrique, W.F., Veronese, L., Badue, C., 2009. Automated multi-label text categorization with VG-RAM weightless neural networks. Neurocomputing, 72(10-12):2209-2217.

[5]He, J., Tan, A.H., Tan, C.L., 2003. On machine learning methods for Chinese document categorization. Appl. Intell., 18(3):311-322.

[6]Jagadish, H.V., Ooi, B.C., Tan, K.L., Yu, C., Zhang, R., 2005. iDistance: an adaptive B⁺-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst., 30(2):364-397.

[7]Jiang, J.Y., Tsai, S.C., Lee, S.J., 2012. FSKNN: multi-label text categorization based on fuzzy similarity and k nearest neighbors. Expert Syst. Appl., 39(3):2813-2821.

[8]Jiang, S.Y., Pang, G.S., Wu, M.L., Kuang, L.M., 2012. An improved K-nearest-neighbor algorithm for text categorization. Expert Syst. Appl., 39(1):1503-1509.

[9]Lee, L.H., Isa, D., Choo, W.O., Chue, W.Y., 2012. High relevance keyword extraction facility for Bayesian text classification on different domains of varying characteristic. Expert Syst. Appl., 39(1):1147-1155.

[10]Liu, B., 2011. Web Data Mining (2nd Ed.). Springer, Berlin, Heidelberg, p.217.

[11]Miao, Y.Q., Kamel, M., 2011. Pairwise optimized Rocchio algorithm for text categorization. Pattern Recogn. Lett., 32(2):375-382.

[12]Qi, X.G., Davison, B.D., 2009. Web page classification: features and algorithms. ACM Comput. Surv., 41(2):12-42.

[13]Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z., 2007. A novel feature selection algorithm for text categorization. Expert Syst. Appl., 33(1):1-5.

[14]Wang, B.K., Huang, Y.F., Yang, W.X., Li, X., 2012. Short text classification based on strong feature thesaurus. J. Zhejiang Univ.-Sci. C (Comput. & Electron.), 13(9):649-659.

[15]Wang, S.G., Li, D.Y., Song, X.L., Wei, Y.J., Li, H.X., 2011. A feature selection method based on improved Fisher’s discriminant ratio for text sentiment classification. Expert Syst. Appl., 38(7):8696-8702.

[16]Wang, Y., Wang, Z.O., 2007. A Fast KNN Algorithm for Text Categorization. Proc. 6th Int. Conf. on Machine Learning and Cybernetics, p.3436-3441.

[17]Zhang, X., Huang, H., Zhang, K., 2009. KNN Text Categorization Algorithm Based on Semantic Centre. Proc. Int. Conf. on Information Technology and Computer Science, p.249-252.

[18]Zhou, B., Yao, Y.Y., Luo, J., 2010. A Three-Way Decision Approach to Email Spam Filtering. Proc. 23rd Canadian Conf. on Artificial Intelligence, p.28-39.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference