Back to Search Start Over

Supplementing training with data from a shifted distribution for machine learning classifiers: adding more cases may not always help

Authors :
Nicholas Petrick
Alexej Gossmann
Kenny H. Cha
Berkman Sahiner
Source :
Medical Imaging: Image Perception, Observer Performance, and Technology Assessment
Publication Year :
2020
Publisher :
SPIE, 2020.

Abstract

In this study, we show that when a training data set is supplemented by drawing samples from a distribution that is different from that of the target population, the differences in the distributions of the original and supplemental training populations should be considered to maximize the performance of the classifier in the target population. Depending on these distributions, drawing a large number of cases from the supplemental distribution may result in lower performance compared to limiting the number of added cases. This is relevant for medical images when synthetic data is used for training a machine learning algorithm, which may result in a mixed distribution for the training set. We simulated a twoclass classification problem and determined the performance of a linear classifier and a neural network classifier on test cases when trained with cases from only the target distribution, and when cases from a shifted, supplemental distribution are added to a limited number of cases from the target distribution. We show that adding data from a supplemental distribution for machine learning classifier training may improve the performance on the target test distribution. However, given the same number of training cases from a mixed distribution, the performance may not reach the performance of only training on data from the target distribution. In addition, the increase in performance will peak or plateau, depending on the shift in the distribution and the number of cases from the supplemental distribution.

Details

Database :
OpenAIRE
Journal :
Medical Imaging 2020: Image Perception, Observer Performance, and Technology Assessment
Accession number :
edsair.doi...........b488828b16e8da542dca747823d29efe
Full Text :
https://doi.org/10.1117/12.2550538