Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

A new feature selection method for handling redundant information in text classification

Abstract: Feature selection is an important approach to dimensionality reduction in the field of text classification. Because of the difficulty in handling the problem that the selected features always contain redundant information, we propose a new simple feature selection method, which can effectively filter the redundant features. First, to calculate the relationship between two words, the definitions of word frequency based relevance and correlative redundancy are introduced. Furthermore, an optimal feature selection (OFS) method is chosen to obtain a feature subset FS1. Finally, to improve the execution speed, the redundant features in FS1 are filtered by combining a predetermined threshold, and the filtered features are memorized in the linked lists. Experiments are carried out on three datasets (WebKB, 20-Newsgroups, and Reuters-21578) where in support vector machines and naïve Bayes are used. The results show that the classification accuracy of the proposed method is generally higher than that of typical traditional methods (information gain, improved Gini index, and improved comprehensively measured feature selection) and the OFS methods. Moreover, the proposed method runs faster than typical mutual information-based methods (improved and normalized mutual information-based feature selections, and multilabel feature selection based on maximum dependency and minimum redundancy) while simultaneously ensuring classification accuracy. Statistical results validate the effectiveness of the proposed method in handling redundant information in text classification.

Key words: Feature selection; Dimensionality reduction; Text classification; Redundant features; Support vector machine; Naïve Bayes; Mutual information

Chinese Summary  <28> 一种用于文本分类的去冗余特征选择新方法

概要:特征选择是文本分类领域一种重要降维方法。针对传统特征选择方法所选特征集常包含冗余信息的问题,提出一种能够有效去除冗余信息的特征选择新方法。首先,为衡量两个词之间的关系,引入基于词频的相关性和相对冗余词集的概念;接着,选择一种最优特征选择方法并用其获得一个临时特征子集;最后,为提高算法执行效率,结合预设阈值去除临时特征子集中的冗余特征,并将结果存储在链表结构中。实验以支持向量机和朴素贝叶斯作为分类器,并以WebKB、20-Newsgroups和Reuters-21578作为测试数据集。实验结果表明,该方法分类精度高于传统特征选择方法;相对于基于互信息的方法而言,该方法能够在保证分类精度的同时,有效提高运行效率。

关键词组:特征选择;降维;文本分类;冗余特征;支持向量机;朴素贝叶斯;互信息


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.1601761

CLC number:

TP391

Download Full Text:

Click Here

Downloaded:

2486

Download summary:

<Click Here> 

Downloaded:

1502

Clicked:

6270

Cited:

0

On-line Access:

2018-04-09

Received:

2016-11-30

Revision Accepted:

2017-02-05

Crosschecked:

2018-02-08

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE