|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2022 Vol.23 No.2 P.278-290
One-against-all-based Hellinger distance decision tree for multiclass imbalanced learning
Abstract: Since traditional machine learning methods are sensitive to skewed distribution and do not consider the characteristics in multiclass imbalance problems, the skewed distribution of multiclass data poses a major challenge to machine learning algorithms. To tackle such issues, we propose a new splitting criterion of the decision tree based on the one-against-all-based Hellinger distance (OAHD). Two crucial elements are included in OAHD. First, the one-against-all scheme is integrated into the process of computing the Hellinger distance in OAHD, thereby extending the Hellinger distance decision tree to cope with the multiclass imbalance problem. Second, for the multiclass imbalance problem, the distribution and the number of distinct classes are taken into account, and a modified Gini index is designed. Moreover, we give theoretical proofs for the properties of OAHD, including skew insensitivity and the ability to seek a purer node in the decision tree. Finally, we collect 20 public real-world imbalanced data sets from the Knowledge Extraction based on Evolutionary Learning (KEEL) repository and the University of California, Irvine (UCI) repository. Experimental and statistical results show that OAHD significantly improves the performance compared with the five other well-known decision trees in terms of Precision, F-measure, and multiclass area under the receiver operating characteristic curve (MAUC). Moreover, through statistical analysis, the Friedman and Nemenyi tests are used to prove the advantage of OAHD over the five other decision trees.
Key words: Decision trees; Multiclass imbalanced learning; Node splitting criterion; Hellinger distance; One-against-all scheme
1桂林理工大学信息科学与工程学院,中国桂林市,541004
2广西嵌入式技术与智能系统重点实验室,中国桂林市,541004
3桂林电子科技大学广西可信软件重点实验室,中国桂林市,541004
摘要:由于传统机器学习方法对偏斜分布很敏感,且未考虑多类不平衡问题的特点,多类偏斜分布对机器学习算法来说是一个巨大挑战。为解决这一问题,提出一种新的基于一对多的海林格距离(OAHD)决策树分割准则。OAHD主要由两部分组成。首先,将一对多思想集成到OAHD的海林格距离计算过程中,从而对海林格距离决策树进行扩展,使其能解决多类不平衡问题。其次,针对多类不平衡问题,考虑了不同类的分布和数量,设计了改进的基尼系数。此外,对OAHD的性质进行理论证明,包括偏斜不敏感性和在决策树中寻找更纯节点的能力。最后,从基于进化学习的知识抽取(KEEL)和加州大学欧文分校(UCI)数据库中收集20个公开的真实不平衡数据集进行实验。实验结果表明,与其他5种常用决策树相比,OAHD在精度、F值,和多类别接收者操作特征曲线下面积(MAUC)上有显著优势。此外,使用了Friedman和Nemenyi检验,统计结果表明OAHD优于其他5种决策树。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.2000417
CLC number:
TP301
Download Full Text:
Downloaded:
9208
Download summary:
<Click Here>Downloaded:
752Clicked:
6667
Cited:
0
On-line Access:
2024-08-27
Received:
2023-10-17
Revision Accepted:
2024-05-08
Crosschecked:
2021-02-14