|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2024 Vol.25 No.9 P.1266-1281
A novel overlapping minimization SMOTE algorithm for imbalanced classification
Abstract: The synthetic minority oversampling technique (SMOTE) is a popular algorithm to reduce the impact of class imbalance in building classifiers, and has received several enhancements over the past 20 years. SMOTE and its variants synthesize a number of minority-class sample points in the original sample space to alleviate the adverse effects of class imbalance. This approach works well in many cases, but problems arise when synthetic sample points are generated in overlapping areas between different classes, which further complicates classifier training. To address this issue, this paper proposes a novel generalization-oriented rather than imputation-oriented minority-class sample point generation algorithm, named overlapping minimization SMOTE (OM-SMOTE). This algorithm is designed specifically for binary imbalanced classification problems. OM-SMOTE first maps the original sample points into a new sample space by balancing sample encoding and classifier generalization. Then, OM-SMOTE employs a set of sophisticated minority-class sample point imputation rules to generate synthetic sample points that are as far as possible from overlapping areas between classes. Extensive experiments have been conducted on 32 imbalanced datasets to validate the effectiveness of OM-SMOTE. Results show that using OM-SMOTE to generate synthetic minority-class sample points leads to better classifier training performances for the naive Bayes, support vector machine, decision tree, and logistic regression classifiers than the 11 state-of-the-art SMOTE-based imputation algorithms. This demonstrates that OM-SMOTE is a viable approach for supporting the training of high-quality classifiers for imbalanced classification. The implementation of OM-SMOTE is shared publicly on the GitHub platform at https://github.com/luxuan123123/OM-SMOTE/.
Key words: Imbalanced classification; Synthetic minority oversampling technique (SMOTE); Majority-class sample point; Minority-class sample point; Generalization capability; Overlapping minimization
1人工智能与数字经济广东省实验室(深圳),中国深圳市,518107
2深圳大学计算机与软件学院,中国深圳市,518060
摘要:合成少数类过采样技术(SMOTE)是不平衡学习领域的经典算法之一,用于减轻类别不平衡对构建分类器的影响。在过去20年中,有上百个基于SMOTE的变体算法被提出。SMOTE及其变体算法通过在原始样本空间中对少数类样本进行插补来平衡数据集,以减轻类别不平衡的不利影响。这种方法在许多情况下表现良好,但当合成样本落入类别之间的交叠区域时,分类器训练的复杂性会增加,进而影响分类器的泛化能力。为解决这一问题,本文提出一种基于交叠最小化的少数类样本生成算法(Overlapping Minimization SMOTE,OM-SMOTE),用于解决二元不平衡分类问题。OM-SMOTE首先通过平衡样本编码和分类器泛化之间的权衡,将原始样本点映射到更加线性可分的样本空间。然后,OM-SMOTE采用一系列复杂的少数类样本点插补规则,使合成样本尽可能远离类别交叠的区域。本文基于32个真实不平衡数据集进行了大量实验,验证了OM-SMOTE算法的有效性。实验结果表明,相对于其他11种先进的基于SMOTE的过采样算法,OM-SMOTE生成的少数类样本点能显著提高朴素贝叶斯、支持向量机、决策树和逻辑回归等分类器的性能。这证明了OM-SMOTE支持训练高质量不平衡分类器的可行性。OM-SMOTE的实现在GitHub平台上(https://github.com/luxuan123123/OM-SMOTE/)公开共享。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.2300278
CLC number:
TP301
Download Full Text:
Downloaded:
884
Download summary:
<Click Here>Downloaded:
114Clicked:
1405
Cited:
0
On-line Access:
2024-08-27
Received:
2023-10-17
Revision Accepted:
2024-05-08
Crosschecked:
2023-09-25