How does imbalanced data affect performance of regression CNNs?

More Info
expand_more

Abstract

This research provides an overview on how training Convolutional Neural Networks (CNNs) on imbalanced datasets affect the performance of the CNNs. Datasets could be imbalanced as a result of several reasons. There are for example naturally less samples of rare diseases. Since the network is trained less on those instances, this might lead to worse performance on those cases. However, it might be more crucial to identify those cases properly. Furthermore, it is non-trivial to check whether real-time generated data is balanced. The networks in this research are trained on three different types of synthetic datasets. Balanced datasets, datasets with missing targets and datasets that have normally distributed targets. The task of the network is to find the standard deviation of the pixel intensity of the input. The results show that it is best to train the network on balanced datasets, however training networks on datasets with normally distributed targets does not result in a big loss. Furthermore, in this case the CNNs were still able to learn the task with decent performance if the training set missed targets.