Over-sampling methods for mixed data in imbalanced problems.

Authors :: Alonso, Hugo
Pinto da Costa, Joaquim Fernando
Source :: Communications in Statistics: Simulation & Computation. Dec2024, p1-23. 23p. 4 Illustrations.
Publication Year :: 2024
Abstract: AbstractIn practice, it is common to find imbalanced classification problems, where one or more classes have many fewer examples than the others. There are several ways to deal with imbalance in order to improve the classification results in the less represented class(es) and one of them consists in applying re-sampling methods. Furthermore, it is no less common for data sets in imbalanced classification problems to be a mix of nominal, ordinal, quantitative discrete and continuous data. However, the true nature of the data tends to be ignored, like when ordinal data are treated as nominal. In this paper, we propose several re-sampling methods for mixed data, which take into account the four scales of measurement usually found in real data. They are based on the popular synthetic minority over-sampling technique or SMOTE. We consider different measures of distance adequate for mixed data. We also introduce new ways of creating the synthetic examples, using all of the nearest neighbors. We show through a comparative study that it pays off taking into account the true nature of the data and the new ways of creating synthetic examples. [ABSTRACT FROM AUTHOR]