Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

A novel overlapping minimization SMOTE algorithm for imbalanced classification

Abstract: The synthetic minority oversampling technique (SMOTE) is a popular algorithm to reduce the impact of class imbalance in building classifiers, and has received several enhancements over the past 20 years. SMOTE and its variants synthesize a number of minority-class sample points in the original sample space to alleviate the adverse effects of class imbalance. This approach works well in many cases, but problems arise when synthetic sample points are generated in overlapping areas between different classes, which further complicates classifier training. To address this issue, this paper proposes a novel generalization-oriented rather than imputation-oriented minority-class sample point generation algorithm, named overlapping minimization SMOTE (OM-SMOTE). This algorithm is designed specifically for binary imbalanced classification problems. OM-SMOTE first maps the original sample points into a new sample space by balancing sample encoding and classifier generalization. Then, OM-SMOTE employs a set of sophisticated minority-class sample point imputation rules to generate synthetic sample points that are as far as possible from overlapping areas between classes. Extensive experiments have been conducted on 32 imbalanced datasets to validate the effectiveness of OM-SMOTE. Results show that using OM-SMOTE to generate synthetic minority-class sample points leads to better classifier training performances for the naive Bayes, support vector machine, decision tree, and logistic regression classifiers than the 11 state-of-the-art SMOTE-based imputation algorithms. This demonstrates that OM-SMOTE is a viable approach for supporting the training of high-quality classifiers for imbalanced classification. The implementation of OM-SMOTE is shared publicly on the GitHub platform at https://github.com/luxuan123123/OM-SMOTE/.

Key words: Imbalanced classification; Synthetic minority oversampling technique (SMOTE); Majority-class sample point; Minority-class sample point; Generalization capability; Overlapping minimization

Chinese Summary  <12> 一种用于不平衡学习分类的新型交叠最小化SMOTE算法

何玉林1,2,路璇2,Philippe FOURNIER-VIGER2,黄哲学1,2
1人工智能与数字经济广东省实验室(深圳),中国深圳市,518107
2深圳大学计算机与软件学院,中国深圳市,518060
摘要:合成少数类过采样技术(SMOTE)是不平衡学习领域的经典算法之一,用于减轻类别不平衡对构建分类器的影响。在过去20年中,有上百个基于SMOTE的变体算法被提出。SMOTE及其变体算法通过在原始样本空间中对少数类样本进行插补来平衡数据集,以减轻类别不平衡的不利影响。这种方法在许多情况下表现良好,但当合成样本落入类别之间的交叠区域时,分类器训练的复杂性会增加,进而影响分类器的泛化能力。为解决这一问题,本文提出一种基于交叠最小化的少数类样本生成算法(Overlapping Minimization SMOTE,OM-SMOTE),用于解决二元不平衡分类问题。OM-SMOTE首先通过平衡样本编码和分类器泛化之间的权衡,将原始样本点映射到更加线性可分的样本空间。然后,OM-SMOTE采用一系列复杂的少数类样本点插补规则,使合成样本尽可能远离类别交叠的区域。本文基于32个真实不平衡数据集进行了大量实验,验证了OM-SMOTE算法的有效性。实验结果表明,相对于其他11种先进的基于SMOTE的过采样算法,OM-SMOTE生成的少数类样本点能显著提高朴素贝叶斯、支持向量机、决策树和逻辑回归等分类器的性能。这证明了OM-SMOTE支持训练高质量不平衡分类器的可行性。OM-SMOTE的实现在GitHub平台上(https://github.com/luxuan123123/OM-SMOTE/)公开共享。

关键词组:不平衡分类;合成少数类过采样技术;多数类样本;少数类样本;泛化能力;交叠最小化


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.2300278

CLC number:

TP301

Download Full Text:

Click Here

Downloaded:

884

Download summary:

<Click Here> 

Downloaded:

114

Clicked:

1405

Cited:

0

On-line Access:

2024-08-27

Received:

2023-10-17

Revision Accepted:

2024-05-08

Crosschecked:

2023-09-25

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE