Learning the scale of image features in Convolutional Neural Networks

More Info
expand_more

Abstract

The millions of filter weights in Convolutional Neural Networks (CNNs), all have a well-defined and analytical expression for the partial derivative to the loss function. Therefor these weights can be learned from data with a technique called gradient descent optimization. While the filter weights have a well-defined derivative, the filter size has not. There is currently no other way to optimize filter sizes, then with an exhaustive search over multiple CNNs trained with different filter sizes.
In this report, we propose a new filter called Structured Receptive Field of which the filter size can be optimized during the training stage. This new filter constitutes a parameterization of a normal filter as a linear combination of all 2D Gaussian derivatives up to a certain order. Instead of learning the weights of the resulting filter directly, we learn the weights of the linear combination and the sigma of the Gaussian derivatives that implicitly define the filter weights. The advantage of parameterizing normal filters in this way, is that gradient descent optimization of the continuous sigma parameter which has a well-defined derivative, can be used as a proxy for optimizing the discrete filter size which has no well-defined derivative.
The basic idea for this parameterization comes from scale-space theory in which the scale of image features is studied by parameterizing the image features with a continuous scale parameter. Thereby explicitly decoupling the spatial structure of image feature from the scale at which it occurs in the image. Structured receptive fields have several compelling advantages: they can learn bigger filters without increasing the number of learnable parameters and without increasing the complexity of the structure of the filter.
In this report, we both provide theoretical and empirical evidence that structured receptive fields can approximate any filter learned by normal CNNs. I.e. we show that structured receptive fields of order 4 can approximate the filters in all layers of a pre-trained AlexNet. Furthermore, we demonstrate empirically that structured receptive fields can indeed learn their own filter size during training, and that this extra ability over normal filters seems to give them an advantage over normal filters when used for classification tasks. I.e. by replacing normal filters in a DenseNet architecture and keeping every- thing else the same, a ~1% higher test accuracy was obtained on the highly competitive CIFAR-10 benchmark dataset.
It would not be wise to draw hard conclusions from a single training run on a single dataset, but this result definitely looks promising. In future research, we hope to demonstrate that, because of their extra ability to learn their filter size, replacing normal filters with structured receptive fields will always lead to strictly better or equal performance.