Circular Image

R. Bruintjes

info

Please Note

6 records found

Doctoral thesis (2026) - R. Bruintjes, M.J.T. Reinders, J.C. van Gemert
The field of computer vision research is very large and still growing. Many of these papers concern some type of inductive bias, by proposing new building blocks or alternative training methods for vision models. This type of research has enabled great progress in applications of vision models.

Computer vision concerns itself with the research and development of deep learning models that work on visual data. These vision models are already heavily integrated into society, powering real-world applications such as automated radiology in hospitals, self-driving cars, and autonomous drones. However, it takes a lot of data, in the form of datasets containing thousands or millions of images, to learn reliable vision models. This thesis explores the role that spatial biases (prior knowledge on the position and pose of objects in the image) can play in learning better and more data-efficient vision models.

We find that the practice of integrating prior knowledge on spatial biases (inductive spatial biases) can help to learn biases that are otherwise hard or impossible to learn. Though inductive bias can be difficult and time-consuming to design, and often increases inference cost, integrating inductive bias can result in better performance and greater data efficiency. This work showcases these patterns in spatial biases, specifically position bias and scale bias.

We find that position bias may be learned to some degree by models without the proper inductive bias, but that inductive bias helps to model these biases and improves performance. We show that whether learning position bias is helpful depends on the data. We contribute measures for position bias in vision models in general, as well as in Vision Transformers specifically, to enable the discovery of these findings. We propose an inductive bias on the position embedding of ViTs to better (un)learn position bias.

For scale bias, we find that existing scale-equivariant models for scale bias need to be tuned to the scale distribution of the data. We propose an inductive bias that allows scale-equivariant models to learn the scale bias of the dataset, thereby fitting the data better. We also propose an alternative parameterization of convolutions called MAGNet that can be adapted to known scale distributions present in the data. Models using MAGNets (FlexNets) can be much shallower and do not require pooling.

There are those who advocate against spending much time on inductive biases. The “bitter lesson” of Richard Sutton prescribes that we should simply add more data, not more inductive bias. However, data will run out at some point, perhaps sooner rather than later. Besides raw performance of vision models, given as much data as possible, should not be our only goal: data-deficient settings are real, plentiful, and important. Data-efficient vision models are the future of our field, and the search for appropriate inductive biases will remain an important endeavor. ...
Journal article (2024) - Mark Basting, Robert Jan Bruintjes, Thaddäus Wiedemer, Matthias Kümmerer, Matthias Bethge, Jan van Gemert
Objects can take up an arbitrary number of pixels in an image: Objects come in different sizes, and, photographs of these objects may be taken at various distances to the camera. These pixel size variations are problematic for CNNs, causing them to learn separate filters for scaled variants of the same objects which prevents learning across scales. This is addressed by scale-equivariant approaches that share features across a set of pre-determined fixed internal scales. These works, however, give little information about how to best choose the internal scales when the underlying distribution of sizes, or scale distribution, in the dataset, is unknown. In this work we investigate learning the internal scales distribution in scale-equivariant CNNs, allowing them to adapt to unknown data scale distributions. We show that our method can learn the internal scales on various data scale distributions and can adapt the internal scales in current scale-equivariant approaches. ...
Color is a crucial visual cue readily exploited by Convolutional Neural Networks (CNNs) for object recognition. However, CNNs struggle if there is data imbalance between color variations introduced by accidental recording conditions. Color invariance addresses this issue but does so at the cost of removing all color information, which sacrifices discriminative power. In this paper, we propose Color Equivariant Convolutions (CEConvs), a novel deep learning building block that enables shape feature sharing across the color spectrum while retaining important color information. We extend the notion of equivariance from geometric to photometric transformations by incorporating parameter sharing over hue-shifts in a neural network. We demonstrate the benefits of CEConvs in terms of downstream performance to various tasks and improved robustness to color changes, including train-test distribution shifts. Our approach can be seamlessly integrated into existing architectures, such as ResNets, and offers a promising solution for addressing color-based domain shifts in CNNs. ...
Conference paper (2023) - Robert-Jan Bruintjes, Tomasz Motyka, Jan van Gemert
Equivariance w.r.t. geometric transformations in neural networks improves data efficiency, parameter efficiency and robustness to out-of-domain perspective shifts. When equivariance is not designed into a neural network, the network can still learn equivariant functions from the data. We quantify this learned equivariance, by proposing an improved measure for equivariance. We find evidence for a correlation between learned translation equivariance and validation accuracy on ImageNet. We therefore investigate what can increase the learned equivariance in neural networks, and find that data augmentation, reduced model capacity and inductive bias in the form of convolutions induce higher learned equivariance in neural networks. ...
In temporal action localization, given an input video, the goal is to predict which actions it contains, where they begin, and where they end. Training and testing current state-of- the-art deep learning models requires access to large amounts of data and computational power. However, gathering such data is challenging and computational resources might be limited. This work explores and measures how current deep temporal action localization models perform in settings constrained by the amount of data or computational power. We measure data efficiency by training each model on a subset of the training set. We find that TemporalMaxer outperforms other models in data-limited settings. Furthermore, we recommend TriDet when training time is limited. To test the efficiency of the models during inference, we pass videos of different lengths through each model. We find that TemporalMaxer requires the least computational resources, likely due to its simple architecture. ...
Conference paper (2022) - David W. Romero, R. Bruintjes, Erik J. Bekkers, Jakub M. Tomczak, Mark Hoogendoorn, J.C. van Gemert
When designing Convolutional Neural Networks (CNNs), one must select the size of the convolutional kernels before training. Recent works show CNNs benefit from different kernel sizes at different layers, but exploring all possible combinations is unfeasible in practice. A more efficient approach is to learn the kernel size during training. However, existing works that learn the kernel size have a limited bandwidth. These approaches scale kernels by dilation, and thus the detail they can describe is limited. In this work, we propose FlexConv, a novel convolutional operation with which high bandwidth convolutional kernels of learnable kernel size can be learned at a fixed parameter cost. FlexNets model long-term dependencies without the use of pooling, achieve state-of-the-art performance on several sequential datasets, outperform recent works with learned kernel sizes, and are competitive with much deeper ResNets on image benchmark datasets. Additionally, FlexNets can be deployed at higher resolutions than those seen during training. To avoid aliasing, we propose a novel kernel parameterization with which the frequency of the kernels can be analytically controlled. Our novel kernel parameterization shows higher descriptive power and faster convergence speed than existing parameterizations. This leads to important improvements in classification accuracy. ...