CLC number: TP301
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2023-09-25
Cited: 0
Clicked: 1030
Yulin HE, Xuan LU, Philippe FOURNIER-VIGER, Joshua Zhexue HUANG. A novel overlapping minimization SMOTE algorithm for imbalanced classification[J]. Frontiers of Information Technology & Electronic Engineering, 2024, 25(9): 1266-1281.
@article{title="A novel overlapping minimization SMOTE algorithm for imbalanced classification",
author="Yulin HE, Xuan LU, Philippe FOURNIER-VIGER, Joshua Zhexue HUANG",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="25",
number="9",
pages="1266-1281",
year="2024",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2300278"
}
%0 Journal Article
%T A novel overlapping minimization SMOTE algorithm for imbalanced classification
%A Yulin HE
%A Xuan LU
%A Philippe FOURNIER-VIGER
%A Joshua Zhexue HUANG
%J Frontiers of Information Technology & Electronic Engineering
%V 25
%N 9
%P 1266-1281
%@ 2095-9184
%D 2024
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2300278
TY - JOUR
T1 - A novel overlapping minimization SMOTE algorithm for imbalanced classification
A1 - Yulin HE
A1 - Xuan LU
A1 - Philippe FOURNIER-VIGER
A1 - Joshua Zhexue HUANG
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 25
IS - 9
SP - 1266
EP - 1281
%@ 2095-9184
Y1 - 2024
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2300278
Abstract: The synthetic minority oversampling technique (SMOTE) is a popular algorithm to reduce the impact of class imbalance in building classifiers, and has received several enhancements over the past 20 years. SMOTE and its variants synthesize a number of minority-class sample points in the original sample space to alleviate the adverse effects of class imbalance. This approach works well in many cases, but problems arise when synthetic sample points are generated in overlapping areas between different classes, which further complicates classifier training. To address this issue, this paper proposes a novel generalization-oriented rather than imputation-oriented minority-class sample point generation algorithm, named overlapping minimization SMOTE (OM-SMOTE). This algorithm is designed specifically for binary imbalanced classification problems. OM-SMOTE first maps the original sample points into a new sample space by balancing sample encoding and classifier generalization. Then, OM-SMOTE employs a set of sophisticated minority-class sample point imputation rules to generate synthetic sample points that are as far as possible from overlapping areas between classes. Extensive experiments have been conducted on 32 imbalanced datasets to validate the effectiveness of OM-SMOTE. Results show that using OM-SMOTE to generate synthetic minority-class sample points leads to better classifier training performances for the naive Bayes, support vector machine, decision tree, and logistic regression classifiers than the 11 state-of-the-art SMOTE-based imputation algorithms. This demonstrates that OM-SMOTE is a viable approach for supporting the training of high-quality classifiers for imbalanced classification. The implementation of OM-SMOTE is shared publicly on the GitHub platform at https://github.com/luxuan123123/OM-SMOTE/.
[1]Bank D, Koenigstein N, Giryes R, 2020. Autoencoders. https://arxiv.org/abs/2003.05991
[2]Barua S, Islam M, Murase K, 2011. A novel synthetic minority oversampling technique for imbalanced data set learning. Proc 18th Int Conf on Neural Information Processing, p.735-744.
[3]Bej S, Davtyan N, Wolfien M, et al., 2021. LoRAS: an oversampling approach for imbalanced datasets. Mach Learn, 110(2):279-301.
[4]Bellinger C, Japkowicz N, Drummond C, 2015. Synthetic oversampling for advanced radioactive threat detection. IEEE 14th Int Conf on Machine Learning and Applications, p.948-953.
[5]Bellinger C, Drummond C, Japkowicz N, 2016. Beyond the boundaries of SMOTE. Proc 13th Pacific-Asia Conf on Knowledge Discovery and Data Mining, p.248-263.
[6]Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C, 2009. Safe-Level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Proc 13th Pacific-Asia Conf on Knowledge Discovery and Data Mining, p.475-482.
[7]Chawla NV, Bowyer KW, Hall LO, et al., 2002. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res, 16:321-357.
[8]Cover TM, 1965. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans Electron Comput, EC-14(3):326-334.
[9]Demšar J, 2006. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res, 7:1-30.
[10]Dong YJ, Wang XH, 2011. A new over-sampling approach: Random-SMOTE for learning from imbalanced data sets. Proc 5th Int Conf on Knowledge Science, Engineering and Management, p.343-352.
[11]Douzas G, Bacao F, 2019. Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE. Inform Sci, 501:118-135.
[12]Douzas G, Bacao F, Last F, 2018. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inform Sci, 465:1-20.
[13]Douzas G, Rauch R, Bacao F, 2021. G-SOMO: an oversampling approach based on self-organized maps and geometric SMOTE. Expert Syst Appl, 183:115230.
[14]Fernández A, Garcia S, Herrera F, et al., 2018. SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res, 61:863-905.
[15]Gosain A, Sardana S, 2019. Farthest SMOTE: a modified SMOTE approach. In: Behera HS, Nayak J, Naik B, et al. (Eds.), Computational Intelligence in Data Mining. Springer, Singapore, p.309-320.
[16]Gu Q, Cai ZH, Zhu L, 2009. Classification of imbalanced data sets by using the hybrid re-sampling algorithm based on Isomap. Proc 4th Int Symp on Intelligence Computation and Applications, p.287-296.
[17]Guo HX, Li YJ, Shang J, et al., 2017. Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl, 73:220-239.
[18]Han H, Wang WY, Mao BH, 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Proc Int Conf on Intelligent Computing, p.878-887.
[19]Hand DJ, Till RJ, 2001. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn, 45(2):171-186.
[20]He HB, Garcia EA, 2009. Learning from imbalanced data. IEEE Trans Knowl Data Eng, 21(9):1263-1284.
[21]He HB, Bai Y, Garcia EA, et al., 2008. ADASYN: adaptive synthetic sampling approach for imbalanced learning. Proc IEEE Int Joint Conf on Neural Networks, p.1322-1328.
[22]He YL, Xu SS, Huang JZ, 2022. Creating synthetic minority class samples based on autoencoder extreme learning machine. Patt Recogn, 121:108191.
[23]Kovács G, 2019. SMOTE-variants: a Python implementation of 85 minority oversampling techniques. Neurocomputing, 366:352-354.
[24]Kunakorntum I, Hinthong W, Phunchongharn P, 2020. A synthetic minority based on probabilistic distribution (SyMProD) oversampling for imbalanced datasets. IEEE Access, 8:114692-114704.
[25]Li JY, Fong S, Wong RK, et al., 2018. Adaptive multi-objective swarm fusion for imbalanced data classification. Inform Fus, 39:1-24.
[26]Li W, Zhao SS, Chen Y, et al., 2022. State of China’s climate in 2021. Atmos Ocean Sci Lett, 15(4):100211.
[27]Lim SK, Tran NT, Cheung NM, 2018. DOPING: generative data augmentation for unsupervised anomaly detection with GAN. Proc IEEE Int Conf on Data Mining, p.1122-1127.
[28]Lipton ZC, Elkan C, Naryanaswamy B, 2014. Optimal thresholding of classifiers to maximize F1 measure. Proc Joint European Conf on Machine Learning and Knowledge Discovery in Databases, p.225-239.
[29]Mathew J, Luo M, Pang CK, et al., 2015. Kernel-based SMOTE for SVM classification of imbalanced datasets. Proc 41st Annual Conf of the IEEE Industrial Electronics Society, p.1127-1132.
[30]Moulaei K, Shanbehzadeh M, Mohammadi-Taghiabad Z, et al., 2022. Comparing machine learning algorithms for predicting COVID-19 mortality. BMC Med Inform Decis Mak, 22(1):2.
[31]Pérez-Ortiz M, Gutiérrez PA, Tino P, et al., 2016. Oversampling the minority class in the feature space. IEEE Trans Neur Netw Learn Syst, 27(9):1947-1961.
[32]Sáez JA, Luengo J, Stefanowski J, et al., 2015. SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inform Sci, 291:184-203.
[33]Sáez JA, Galar M, Krawczyk B, 2019. Addressing the overlapping data problem in classification using the One-vs-One decomposition strategy. IEEE Access, 7:83396-83411.
[34]Salloum S, Huang JZ, He YL, 2019. Random sample partition: a distributed data model for big data analysis. IEEE Trans Ind Inform, 15(11):5846-5854.
[35]Sun YM, Kamel MS, Wang Y, 2006. Boosting for learning multiple classes with imbalanced class distribution. Proc 6th Int Conf on Data Mining, p.592-602.
[36]Tang W, Mao KZ, Mak LO, et al., 2010. Classification for overlapping classes using optimized overlapping region detection and soft decision. Proc 13th Int Conf on Information Fusion, p.1-8.
Open peer comments: Debate/Discuss/Question/Opinion
<1>