Back to Search Start Over

Two-step ensemble under-sampling algorithm for massive imbalanced data classification.

Authors :
Bai, Lin
Ju, Tong
Wang, Hao
Lei, Mingzhu
Pan, Xiaoying
Source :
Information Sciences. Apr2024, Vol. 665, pN.PAG-N.PAG. 1p.
Publication Year :
2024

Abstract

Imbalanced data classification is a challenging problem in the field of machine learning. Class imbalance, class overlap, and large data volume significantly affect classification performance. Focusing on the impact of class overlap on classification effectiveness, we propose a two-step ensemble under-sampling algorithm based on boundary information mining (TSSE-BIM) with the goal of reducing the information loss from under-sampling methods on large-scale imbalanced data. In the first stage, the proposed method applies an improved equalization under-sampling strategy to mine sample contribution information and quickly obtains the distribution information of data relative to the decision boundary. In the second stage, based on the boundary information, a weighted boundary sampling is performed to remove noisy and highly overlapping samples. It is easy to retain samples with high contribution and effectively suppress the information loss caused by under-sampling. Then, the overall framework is designed based on a serial ensemble similar to boosting, where the weights of each base classifier are assigned to achieve a more powerful performance based on the false positive rate and false negative rate on the original data. Finally, extensive experiments indicate that TSSE-BIM outperforms state-of-the-art methods and ranks first on average under four metrics, especially F1 and MCC. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
00200255
Volume :
665
Database :
Academic Search Index
Journal :
Information Sciences
Publication Type :
Periodical
Accession number :
176100197
Full Text :
https://doi.org/10.1016/j.ins.2024.120351