CLC number: TP301
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2023-09-25
Cited: 0
Clicked: 1780
Yulin HE, Xuan LU, Philippe FOURNIER-VIGER, Joshua Zhexue HUANG. A novel overlapping minimization SMOTE algorithm for imbalanced classification[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2300278 @article{title="A novel overlapping minimization SMOTE algorithm for imbalanced classification", %0 Journal Article TY - JOUR
一种用于不平衡学习分类的新型交叠最小化SMOTE算法1人工智能与数字经济广东省实验室(深圳),中国深圳市,518107 2深圳大学计算机与软件学院,中国深圳市,518060 摘要:合成少数类过采样技术(SMOTE)是不平衡学习领域的经典算法之一,用于减轻类别不平衡对构建分类器的影响。在过去20年中,有上百个基于SMOTE的变体算法被提出。SMOTE及其变体算法通过在原始样本空间中对少数类样本进行插补来平衡数据集,以减轻类别不平衡的不利影响。这种方法在许多情况下表现良好,但当合成样本落入类别之间的交叠区域时,分类器训练的复杂性会增加,进而影响分类器的泛化能力。为解决这一问题,本文提出一种基于交叠最小化的少数类样本生成算法(Overlapping Minimization SMOTE,OM-SMOTE),用于解决二元不平衡分类问题。OM-SMOTE首先通过平衡样本编码和分类器泛化之间的权衡,将原始样本点映射到更加线性可分的样本空间。然后,OM-SMOTE采用一系列复杂的少数类样本点插补规则,使合成样本尽可能远离类别交叠的区域。本文基于32个真实不平衡数据集进行了大量实验,验证了OM-SMOTE算法的有效性。实验结果表明,相对于其他11种先进的基于SMOTE的过采样算法,OM-SMOTE生成的少数类样本点能显著提高朴素贝叶斯、支持向量机、决策树和逻辑回归等分类器的性能。这证明了OM-SMOTE支持训练高质量不平衡分类器的可行性。OM-SMOTE的实现在GitHub平台上(https://github.com/luxuan123123/OM-SMOTE/)公开共享。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Bank D, Koenigstein N, Giryes R, 2020. Autoencoders. https://arxiv.org/abs/2003.05991 ![]() [2]Barua S, Islam M, Murase K, 2011. A novel synthetic minority oversampling technique for imbalanced data set learning. Proc 18th Int Conf on Neural Information Processing, p.735-744. ![]() [3]Bej S, Davtyan N, Wolfien M, et al., 2021. LoRAS: an oversampling approach for imbalanced datasets. Mach Learn, 110(2):279-301. ![]() [4]Bellinger C, Japkowicz N, Drummond C, 2015. Synthetic oversampling for advanced radioactive threat detection. IEEE 14th Int Conf on Machine Learning and Applications, p.948-953. ![]() [5]Bellinger C, Drummond C, Japkowicz N, 2016. Beyond the boundaries of SMOTE. Proc 13th Pacific-Asia Conf on Knowledge Discovery and Data Mining, p.248-263. ![]() [6]Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C, 2009. Safe-Level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Proc 13th Pacific-Asia Conf on Knowledge Discovery and Data Mining, p.475-482. ![]() [7]Chawla NV, Bowyer KW, Hall LO, et al., 2002. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res, 16:321-357. ![]() [8]Cover TM, 1965. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans Electron Comput, EC-14(3):326-334. ![]() [9]Demšar J, 2006. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res, 7:1-30. ![]() [10]Dong YJ, Wang XH, 2011. A new over-sampling approach: Random-SMOTE for learning from imbalanced data sets. Proc 5th Int Conf on Knowledge Science, Engineering and Management, p.343-352. ![]() [11]Douzas G, Bacao F, 2019. Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE. Inform Sci, 501:118-135. ![]() [12]Douzas G, Bacao F, Last F, 2018. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inform Sci, 465:1-20. ![]() [13]Douzas G, Rauch R, Bacao F, 2021. G-SOMO: an oversampling approach based on self-organized maps and geometric SMOTE. Expert Syst Appl, 183:115230. ![]() [14]Fernández A, Garcia S, Herrera F, et al., 2018. SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res, 61:863-905. ![]() [15]Gosain A, Sardana S, 2019. Farthest SMOTE: a modified SMOTE approach. In: Behera HS, Nayak J, Naik B, et al. (Eds.), Computational Intelligence in Data Mining. Springer, Singapore, p.309-320. ![]() [16]Gu Q, Cai ZH, Zhu L, 2009. Classification of imbalanced data sets by using the hybrid re-sampling algorithm based on Isomap. Proc 4th Int Symp on Intelligence Computation and Applications, p.287-296. ![]() [17]Guo HX, Li YJ, Shang J, et al., 2017. Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl, 73:220-239. ![]() [18]Han H, Wang WY, Mao BH, 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Proc Int Conf on Intelligent Computing, p.878-887. ![]() [19]Hand DJ, Till RJ, 2001. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn, 45(2):171-186. ![]() [20]He HB, Garcia EA, 2009. Learning from imbalanced data. IEEE Trans Knowl Data Eng, 21(9):1263-1284. ![]() [21]He HB, Bai Y, Garcia EA, et al., 2008. ADASYN: adaptive synthetic sampling approach for imbalanced learning. Proc IEEE Int Joint Conf on Neural Networks, p.1322-1328. ![]() [22]He YL, Xu SS, Huang JZ, 2022. Creating synthetic minority class samples based on autoencoder extreme learning machine. Patt Recogn, 121:108191. ![]() [23]Kovács G, 2019. SMOTE-variants: a Python implementation of 85 minority oversampling techniques. Neurocomputing, 366:352-354. ![]() [24]Kunakorntum I, Hinthong W, Phunchongharn P, 2020. A synthetic minority based on probabilistic distribution (SyMProD) oversampling for imbalanced datasets. IEEE Access, 8:114692-114704. ![]() [25]Li JY, Fong S, Wong RK, et al., 2018. Adaptive multi-objective swarm fusion for imbalanced data classification. Inform Fus, 39:1-24. ![]() [26]Li W, Zhao SS, Chen Y, et al., 2022. State of China’s climate in 2021. Atmos Ocean Sci Lett, 15(4):100211. ![]() [27]Lim SK, Tran NT, Cheung NM, 2018. DOPING: generative data augmentation for unsupervised anomaly detection with GAN. Proc IEEE Int Conf on Data Mining, p.1122-1127. ![]() [28]Lipton ZC, Elkan C, Naryanaswamy B, 2014. Optimal thresholding of classifiers to maximize F1 measure. Proc Joint European Conf on Machine Learning and Knowledge Discovery in Databases, p.225-239. ![]() [29]Mathew J, Luo M, Pang CK, et al., 2015. Kernel-based SMOTE for SVM classification of imbalanced datasets. Proc 41st Annual Conf of the IEEE Industrial Electronics Society, p.1127-1132. ![]() [30]Moulaei K, Shanbehzadeh M, Mohammadi-Taghiabad Z, et al., 2022. Comparing machine learning algorithms for predicting COVID-19 mortality. BMC Med Inform Decis Mak, 22(1):2. ![]() [31]Pérez-Ortiz M, Gutiérrez PA, Tino P, et al., 2016. Oversampling the minority class in the feature space. IEEE Trans Neur Netw Learn Syst, 27(9):1947-1961. ![]() [32]Sáez JA, Luengo J, Stefanowski J, et al., 2015. SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inform Sci, 291:184-203. ![]() [33]Sáez JA, Galar M, Krawczyk B, 2019. Addressing the overlapping data problem in classification using the One-vs-One decomposition strategy. IEEE Access, 7:83396-83411. ![]() [34]Salloum S, Huang JZ, He YL, 2019. Random sample partition: a distributed data model for big data analysis. IEEE Trans Ind Inform, 15(11):5846-5854. ![]() [35]Sun YM, Kamel MS, Wang Y, 2006. Boosting for learning multiple classes with imbalanced class distribution. Proc 6th Int Conf on Data Mining, p.592-602. ![]() [36]Tang W, Mao KZ, Mak LO, et al., 2010. Classification for overlapping classes using optimized overlapping region detection and soft decision. Proc 13th Int Conf on Information Fusion, p.1-8. ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2025 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>