How does imbalanced data affect performance of regression CNNs?
R.K. Thakoersingh (TU Delft - Electrical Engineering, Mathematics and Computer Science)
T.J. Viering – Mentor (TU Delft - Computer Science & Engineering-Teaching Team)
Y. Kato – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
M. Loog – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
D.M.J. Tax – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
K. Hildebrandt – Coach (TU Delft - Computer Graphics and Visualisation)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
This research provides an overview on how training Convolutional Neural Networks (CNNs) on imbalanced datasets affect the performance of the CNNs. Datasets could be imbalanced as a result of several reasons. There are for example naturally less samples of rare diseases. Since the network is trained less on those instances, this might lead to worse performance on those cases. However, it might be more crucial to identify those cases properly. Furthermore, it is non-trivial to check whether real-time generated data is balanced. The networks in this research are trained on three different types of synthetic datasets. Balanced datasets, datasets with missing targets and datasets that have normally distributed targets. The task of the network is to find the standard deviation of the pixel intensity of the input. The results show that it is best to train the network on balanced datasets, however training networks on datasets with normally distributed targets does not result in a big loss. Furthermore, in this case the CNNs were still able to learn the task with decent performance if the training set missed targets.