Full Text:   <785>

Summary:  <78>

Suppl. Mater.: 

CLC number: TP301

On-line Access: 2024-08-27

Received: 2023-10-17

Revision Accepted: 2024-05-08

Crosschecked: 2023-09-25

Cited: 0

Clicked: 1030

Citations:  Bibtex RefMan EndNote GB/T7714

 ORCID:

Yulin HE

https://orcid.org/0000-0002-3415-0686

-   Go to

Article info.
Open peer comments

Frontiers of Information Technology & Electronic Engineering  2024 Vol.25 No.9 P.1266-1281

http://doi.org/10.1631/FITEE.2300278


A novel overlapping minimization SMOTE algorithm for imbalanced classification


Author(s):  Yulin HE, Xuan LU, Philippe FOURNIER-VIGER, Joshua Zhexue HUANG

Affiliation(s):  Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen 518107, China; more

Corresponding email(s):   yulinhe@gml.ac.cn, 2110276215@email.szu.edu.cn, philfv@szu.edu.cn, zx.huang@szu.edu.cn

Key Words:  Imbalanced classification, Synthetic minority oversampling technique (SMOTE), Majority-class sample point, Minority-class sample point, Generalization capability, Overlapping minimization


Yulin HE, Xuan LU, Philippe FOURNIER-VIGER, Joshua Zhexue HUANG. A novel overlapping minimization SMOTE algorithm for imbalanced classification[J]. Frontiers of Information Technology & Electronic Engineering, 2024, 25(9): 1266-1281.

@article{title="A novel overlapping minimization SMOTE algorithm for imbalanced classification",
author="Yulin HE, Xuan LU, Philippe FOURNIER-VIGER, Joshua Zhexue HUANG",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="25",
number="9",
pages="1266-1281",
year="2024",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2300278"
}

%0 Journal Article
%T A novel overlapping minimization SMOTE algorithm for imbalanced classification
%A Yulin HE
%A Xuan LU
%A Philippe FOURNIER-VIGER
%A Joshua Zhexue HUANG
%J Frontiers of Information Technology & Electronic Engineering
%V 25
%N 9
%P 1266-1281
%@ 2095-9184
%D 2024
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2300278

TY - JOUR
T1 - A novel overlapping minimization SMOTE algorithm for imbalanced classification
A1 - Yulin HE
A1 - Xuan LU
A1 - Philippe FOURNIER-VIGER
A1 - Joshua Zhexue HUANG
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 25
IS - 9
SP - 1266
EP - 1281
%@ 2095-9184
Y1 - 2024
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2300278


Abstract: 
The synthetic minority oversampling technique (SMOTE) is a popular algorithm to reduce the impact of class imbalance in building classifiers, and has received several enhancements over the past 20 years. SMOTE and its variants synthesize a number of minority-class sample points in the original sample space to alleviate the adverse effects of class imbalance. This approach works well in many cases, but problems arise when synthetic sample points are generated in overlapping areas between different classes, which further complicates classifier training. To address this issue, this paper proposes a novel generalization-oriented rather than imputation-oriented minority-class sample point generation algorithm, named overlapping minimization SMOTE (OM-SMOTE). This algorithm is designed specifically for binary imbalanced classification problems. OM-SMOTE first maps the original sample points into a new sample space by balancing sample encoding and classifier generalization. Then, OM-SMOTE employs a set of sophisticated minority-class sample point imputation rules to generate synthetic sample points that are as far as possible from overlapping areas between classes. Extensive experiments have been conducted on 32 imbalanced datasets to validate the effectiveness of OM-SMOTE. Results show that using OM-SMOTE to generate synthetic minority-class sample points leads to better classifier training performances for the naive Bayes, support vector machine, decision tree, and logistic regression classifiers than the 11 state-of-the-art SMOTE-based imputation algorithms. This demonstrates that OM-SMOTE is a viable approach for supporting the training of high-quality classifiers for imbalanced classification. The implementation of OM-SMOTE is shared publicly on the GitHub platform at https://github.com/luxuan123123/OM-SMOTE/.

一种用于不平衡学习分类的新型交叠最小化SMOTE算法

何玉林1,2,路璇2,Philippe FOURNIER-VIGER2,黄哲学1,2
1人工智能与数字经济广东省实验室(深圳),中国深圳市,518107
2深圳大学计算机与软件学院,中国深圳市,518060
摘要:合成少数类过采样技术(SMOTE)是不平衡学习领域的经典算法之一,用于减轻类别不平衡对构建分类器的影响。在过去20年中,有上百个基于SMOTE的变体算法被提出。SMOTE及其变体算法通过在原始样本空间中对少数类样本进行插补来平衡数据集,以减轻类别不平衡的不利影响。这种方法在许多情况下表现良好,但当合成样本落入类别之间的交叠区域时,分类器训练的复杂性会增加,进而影响分类器的泛化能力。为解决这一问题,本文提出一种基于交叠最小化的少数类样本生成算法(Overlapping Minimization SMOTE,OM-SMOTE),用于解决二元不平衡分类问题。OM-SMOTE首先通过平衡样本编码和分类器泛化之间的权衡,将原始样本点映射到更加线性可分的样本空间。然后,OM-SMOTE采用一系列复杂的少数类样本点插补规则,使合成样本尽可能远离类别交叠的区域。本文基于32个真实不平衡数据集进行了大量实验,验证了OM-SMOTE算法的有效性。实验结果表明,相对于其他11种先进的基于SMOTE的过采样算法,OM-SMOTE生成的少数类样本点能显著提高朴素贝叶斯、支持向量机、决策树和逻辑回归等分类器的性能。这证明了OM-SMOTE支持训练高质量不平衡分类器的可行性。OM-SMOTE的实现在GitHub平台上(https://github.com/luxuan123123/OM-SMOTE/)公开共享。

关键词:不平衡分类;合成少数类过采样技术;多数类样本;少数类样本;泛化能力;交叠最小化

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Bank D, Koenigstein N, Giryes R, 2020. Autoencoders. https://arxiv.org/abs/2003.05991

[2]Barua S, Islam M, Murase K, 2011. A novel synthetic minority oversampling technique for imbalanced data set learning. Proc 18th Int Conf on Neural Information Processing, p.735-744.

[3]Bej S, Davtyan N, Wolfien M, et al., 2021. LoRAS: an oversampling approach for imbalanced datasets. Mach Learn, 110(2):279-301.

[4]Bellinger C, Japkowicz N, Drummond C, 2015. Synthetic oversampling for advanced radioactive threat detection. IEEE 14th Int Conf on Machine Learning and Applications, p.948-953.

[5]Bellinger C, Drummond C, Japkowicz N, 2016. Beyond the boundaries of SMOTE. Proc 13th Pacific-Asia Conf on Knowledge Discovery and Data Mining, p.248-263.

[6]Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C, 2009. Safe-Level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Proc 13th Pacific-Asia Conf on Knowledge Discovery and Data Mining, p.475-482.

[7]Chawla NV, Bowyer KW, Hall LO, et al., 2002. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res, 16:321-357.

[8]Cover TM, 1965. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans Electron Comput, EC-14(3):326-334.

[9]Demšar J, 2006. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res, 7:1-30.

[10]Dong YJ, Wang XH, 2011. A new over-sampling approach: Random-SMOTE for learning from imbalanced data sets. Proc 5th Int Conf on Knowledge Science, Engineering and Management, p.343-352.

[11]Douzas G, Bacao F, 2019. Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE. Inform Sci, 501:118-135.

[12]Douzas G, Bacao F, Last F, 2018. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inform Sci, 465:1-20.

[13]Douzas G, Rauch R, Bacao F, 2021. G-SOMO: an oversampling approach based on self-organized maps and geometric SMOTE. Expert Syst Appl, 183:115230.

[14]Fernández A, Garcia S, Herrera F, et al., 2018. SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res, 61:863-905.

[15]Gosain A, Sardana S, 2019. Farthest SMOTE: a modified SMOTE approach. In: Behera HS, Nayak J, Naik B, et al. (Eds.), Computational Intelligence in Data Mining. Springer, Singapore, p.309-320.

[16]Gu Q, Cai ZH, Zhu L, 2009. Classification of imbalanced data sets by using the hybrid re-sampling algorithm based on Isomap. Proc 4th Int Symp on Intelligence Computation and Applications, p.287-296.

[17]Guo HX, Li YJ, Shang J, et al., 2017. Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl, 73:220-239.

[18]Han H, Wang WY, Mao BH, 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Proc Int Conf on Intelligent Computing, p.878-887.

[19]Hand DJ, Till RJ, 2001. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn, 45(2):171-186.

[20]He HB, Garcia EA, 2009. Learning from imbalanced data. IEEE Trans Knowl Data Eng, 21(9):1263-1284.

[21]He HB, Bai Y, Garcia EA, et al., 2008. ADASYN: adaptive synthetic sampling approach for imbalanced learning. Proc IEEE Int Joint Conf on Neural Networks, p.1322-1328.

[22]He YL, Xu SS, Huang JZ, 2022. Creating synthetic minority class samples based on autoencoder extreme learning machine. Patt Recogn, 121:108191.

[23]Kovács G, 2019. SMOTE-variants: a Python implementation of 85 minority oversampling techniques. Neurocomputing, 366:352-354.

[24]Kunakorntum I, Hinthong W, Phunchongharn P, 2020. A synthetic minority based on probabilistic distribution (SyMProD) oversampling for imbalanced datasets. IEEE Access, 8:114692-114704.

[25]Li JY, Fong S, Wong RK, et al., 2018. Adaptive multi-objective swarm fusion for imbalanced data classification. Inform Fus, 39:1-24.

[26]Li W, Zhao SS, Chen Y, et al., 2022. State of China’s climate in 2021. Atmos Ocean Sci Lett, 15(4):100211.

[27]Lim SK, Tran NT, Cheung NM, 2018. DOPING: generative data augmentation for unsupervised anomaly detection with GAN. Proc IEEE Int Conf on Data Mining, p.1122-1127.

[28]Lipton ZC, Elkan C, Naryanaswamy B, 2014. Optimal thresholding of classifiers to maximize F1 measure. Proc Joint European Conf on Machine Learning and Knowledge Discovery in Databases, p.225-239.

[29]Mathew J, Luo M, Pang CK, et al., 2015. Kernel-based SMOTE for SVM classification of imbalanced datasets. Proc 41st Annual Conf of the IEEE Industrial Electronics Society, p.1127-1132.

[30]Moulaei K, Shanbehzadeh M, Mohammadi-Taghiabad Z, et al., 2022. Comparing machine learning algorithms for predicting COVID-19 mortality. BMC Med Inform Decis Mak, 22(1):2.

[31]Pérez-Ortiz M, Gutiérrez PA, Tino P, et al., 2016. Oversampling the minority class in the feature space. IEEE Trans Neur Netw Learn Syst, 27(9):1947-1961.

[32]Sáez JA, Luengo J, Stefanowski J, et al., 2015. SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inform Sci, 291:184-203.

[33]Sáez JA, Galar M, Krawczyk B, 2019. Addressing the overlapping data problem in classification using the One-vs-One decomposition strategy. IEEE Access, 7:83396-83411.

[34]Salloum S, Huang JZ, He YL, 2019. Random sample partition: a distributed data model for big data analysis. IEEE Trans Ind Inform, 15(11):5846-5854.

[35]Sun YM, Kamel MS, Wang Y, 2006. Boosting for learning multiple classes with imbalanced class distribution. Proc 6th Int Conf on Data Mining, p.592-602.

[36]Tang W, Mao KZ, Mak LO, et al., 2010. Classification for overlapping classes using optimized overlapping region detection and soft decision. Proc 13th Int Conf on Information Fusion, p.1-8.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE