Learning Curves

How do Data Imbalances affect the Learning Curves using Nearest Mean Model?

More Info
expand_more

Abstract

This research investigates the impact of data imbalances on the learning curve using the nearest mean model. Learning curves are useful to represent the performance of the model as the training size increases. Imbalanced datasets are often encountered in real-life scenarios and pose challenges to standard classifier models impacting their performance. Thus, the research question is ”How do data imbalances affect the learning curves using the nearest mean model?”. To answer the question, an experiment is conducted using data from a multivariate Gaussian distribution to sample varying levels of imbalances. The imbalance ratio explored is [0.1, 0.2, 0.3, 0.4, 0.5], representing the percentage of the dataset that consists of the minority class. The findings indicated that as the data becomes more imbalanced, the learning curves reach the accuracy plateau at a later rate. The analysis of the curve parameter which follows the logistic function suggests that imbalances have an impact on the maximum achievable accuracy and rightward shift of the curves. However, the maximum achievable accuracy is non-significant and the shape of the curves remains similar. Additionally, false negatives have a significant impact on the learning curves.