Augment it Maybe?

Improving Deep Vision Models with Adversarial Scene Text Augmentation

More Info
expand_more

Abstract

Image data augmentation has been regarded as a reliable and effective way to increase the data available for training. With the advent and rise of Generative AI, generative data augmentation has been shown to realize even better gains in performance for downstream tasks. However, these performance gains are often the cause of "extra information" being seeped into the generated examples via pre-trained model weights, heuristic inclusions etc. In this paper, we showcase the impact of text-in-image augmentation on the performance of an underlying downstream task (classification or recognition). This study specifically looks at the difference in performance when training a classifier under three settings- no augmentation, transform-based augmentation, and generative augmentation- and investigate whether and where this augmentation can be successfully employed to experience gains in performance, without letting any "extra information" seep in. We try to observe this difference in performance under varying amounts of training samples, and for samples with varying similarities to that of the original training data. We also present a new GAN structure- conditional Classification Deep Convolutional GAN (or the CcGAN)- as an improved baseline over the conditional Deep Convolutional GAN (cDCGAN) for our experiments which gave a 4\% performance gain over unaugmented data with no 'extra information'. We find that in certain settings and examples, there exists a performance advantage to train vision models in text-in-image settings using real and generated data. We also confirm that the amount of original training samples available affect the test accuracy achieved by generative augmentation, where a huge fall-off can be seen in extremely low- and high- data regimes; however, it seems to maximize performance at a ”sweet spot” where the robustness and variability added by the generated samples help to realize performance gains. It was also observed that the 1x and 5x augmentations performed better than other configurations. Lastly, we find that the similarity of generations does not affect model performance and does not vary consistently with model performance for most settings.

Files

ThesisReport_ASharma.pdf
(pdf | 8.98 Mb)
License info not available