Visual counting is an important task in computer vision with broad applications in areas such as crowd monitoring, agriculture, and environmental analysis. While deep learning has significantly advanced this field by enabling models to learn robust feature representations, deep l
...
Visual counting is an important task in computer vision with broad applications in areas such as crowd monitoring, agriculture, and environmental analysis. While deep learning has significantly advanced this field by enabling models to learn robust feature representations, deep learning approaches suffer from sensitivity to data imbalances, which occur in the distribution of object counts across counting datasets as a result of annotation effort. Most state-of-the-art counting models, categorized into clustering-, detection-, regression-, and density estimation-based methods, are built upon Convolutional Neural Networks (CNNs) and Transformers, both of which are known to be susceptible to imbalances in the training data. This study introduces a hybrid model that incorporates a programmatically guaranteed counting mechanism using the RASP language and the Tracr compiler, enabling the construction of Transformer-based models that can reliably execute predefined tasks, such as counting. By combining this exact counting mechanism with a trainable embedding module, we present a model that is capable of learning to count various tokens, even under significant data imbalance. We validate our approach on a synthetic, imbalanced dataset and compare its performance, training time, and data efficiency against standard CNN- and Transformer-based models. Results suggest that our method achieves strong generalization across the full spectrum of object counts while requiring less training data, highlighting the potential for this architecture to be further investigated and adapted to be used for robust and efficient visual counting.