J.C. van Gemert | TU Delft Repository

Motion representations for privacy-aware cross-domain action recognition

Journal article (2026) - P. Benschop, J.C. van Gemert, J.P. Mense, J.H.G. Dauwels

Video captured for action recognition often contains sensitive appearance cues such as faces, skin color, and clothing. Models trained on such data may exploit these cues rather than the underlying motion, raising privacy concerns in real-world deployment. In this work, we study action recognition under a motion-focused constraint: the model receives only motion representations that capture pixel displacement over time, while reducing appearance cues that expose identity or scene context. We focus on motion-history images and optical flow as learning-free representations that reduce identifiable appearance information while retaining action recognition accuracy. Our motion I3D model achieves approximately 31% and 52% zero-shot top-1 accuracy on HMDB-51 and UCF-101, respectively, outperforming non-CLIP direct-transfer baselines trained on Kinetics-400 despite operating without any appearance input. In 16-shot adaptation, the same model reaches 52% and 83% top-1 accuracy. In the domain adaptation setting on TP-HMDB↔TP-UCF, our motion-focused models achieve higher action recognition accuracy than prior privacy-preserving methods. Sensitive attribute predictability is reduced relative to RGB by a comparable margin, without requiring a learned privacy filter. On PA-HMDB51, optical flow is the strongest motion representation for privacy preservation, approaching chance level for skin-color prediction and remaining below RGB on most privacy attributes, indicating that motion representations retain useful action information while exposing less personal information. ...

HAVANA

Hierarchical Stochastic Neighbor Embedding for Accelerated Video ANnotAtions

Conference paper (2025) - Alexandru Bobe, Jan C. van Gemert

Video annotation is a critical and time-consuming task in computer vision research and applications. This paper presents a novel annotation pipeline that uses pre-extracted features and dimensionality reduction to accelerate the temporal video annotation process. Our approach uses Hierarchical Stochastic Neighbor Embedding (HSNE) to create a multi-scale representation of video features, allowing annotators to efficiently explore and label large video datasets. We demonstrate significant improvements in annotation effort compared to traditional linear methods, achieving more than a 10x reduction in clicks required for annotating over 12 h of video. Our experiments on multiple datasets show the effectiveness and robustness of our pipeline across various scenarios. Moreover, we investigate the optimal configuration of HSNE parameters for different datasets. Our work provides a promising direction for scaling up video annotation efforts in the era of video understanding. ...

CleanUMamba

A Compact Mamba Network for Speech Denoising using Channel Pruning

Conference paper (2025) - Sjoerd Groot, Qinyu Chen, Jan C. Van Gemert, Chang Gao

This paper presents CleanUMamba, a time-domain neural network architecture designed for real-time causal audio denoising directly applied to raw waveforms. CleanUMamba leverages a U-Net encoder-decoder structure, incorporating the Mamba state-space model in the bottleneck layer. By replacing conventional self-attention and LSTM mechanisms with Mamba, our architecture offers superior denoising performance while maintaining a constant memory footprint, enabling streaming operation. To enhance efficiency, we applied structured channel pruning, achieving an 8X reduction in model size without compromising audio quality. Our model demonstrates strong results in the Interspeech 2020 Deep Noise Suppression challenge. Specifically, CleanUMamba achieves a PESQ score of 2.42 and STOI of 95.1% with only 442K parameters and 468M MACs, matching or outperforming larger models in real-time performance. Code will be available at: https://github.com/lab-emi/CleanUMamba ...

MSD

A Benchmark Dataset for Floor Plan Generation of Building Complexes

Conference paper (2025) - Casper van Engelenburg, Fatemeh Mostafavi, Emanuel Kuhn, Yuntae Jeon, Michael Franzen, Matthias Standfest, Jan van Gemert, Seyran Khademi

Diverse and realistic floor plan data are essential for the development of useful computer-aided methods in architectural design. Today’s large-scale floor plan datasets predominantly feature simple floor plan layouts, typically representing single-apartment dwellings only. To compensate for the mismatch between current datasets and the real world, we develop Modified Swiss Dwellings (MSD) – the first large-scale floor plan dataset that contains a significant share of layouts of multi-apartment dwellings. MSD features over 5.3K floor plans of medium- to large-scale building complexes, covering over 18.9K distinct apartments. We validate that existing approaches for floor plan generation, while effective in simpler scenarios, cannot yet seamlessly address the challenges posed by MSD. Our benchmark calls for new research in floor plan machine understanding. Code and data are open. ...

Scale Learning in Scale-Equivariant Convolutional Networks

Journal article (2024) - Mark Basting, Robert Jan Bruintjes, Thaddäus Wiedemer, Matthias Kümmerer, Matthias Bethge, Jan van Gemert

Objects can take up an arbitrary number of pixels in an image: Objects come in different sizes, and, photographs of these objects may be taken at various distances to the camera. These pixel size variations are problematic for CNNs, causing them to learn separate filters for scaled variants of the same objects which prevents learning across scales. This is addressed by scale-equivariant approaches that share features across a set of pre-determined fixed internal scales. These works, however, give little information about how to best choose the internal scales when the underlying distribution of sizes, or scale distribution, in the dataset, is unknown. In this work we investigate learning the internal scales distribution in scale-equivariant CNNs, allowing them to adapt to unknown data scale distributions. We show that our method can learn the internal scales on various data scale distributions and can adapt the internal scales in current scale-equivariant approaches. ...

Contrast-Agnostic Groupwise Registration by Robust PCA for Quantitative Cardiac MRI

Conference paper (2024) - Xinqi Li, Yi Zhang, Yidong Zhao, Jan van Gemert, Qian Tao

Quantitative cardiac magnetic resonance imaging (MRI) is an increasingly important diagnostic tool for cardiovascular diseases. Yet, co-registration of all baseline images within the quantitative MRI sequence is essential for the accuracy and precision of quantitative maps. However, co-registering all baseline images from a quantitative cardiac MRI sequence remains a nontrivial task because of the simultaneous changes in intensity and contrast, in combination with cardiac and respiratory motion. To address the challenge, we propose a novel motion correction framework based on robust principle component analysis (rPCA) that decomposes quantitative cardiac MRI into low-rank and sparse components, and we integrate the groupwise CNN-based registration backbone within the rPCA framework. The low-rank component of rPCA corresponds to the quantitative mapping (i.e. limited degree of freedom in variation), while the sparse component corresponds to the residual motion, making it easier to formulate and solve the groupwise registration problem. We evaluated our proposed method on cardiac T1 mapping by the modified Look-Locker inversion recovery (MOLLI) sequence, both before and after the Gadolinium contrast agent administration. Our experiments showed that our method effectively improved registration performance over baseline methods without introducing rPCA, and reduced quantitative mapping error in both in-domain (pre-contrast MOLLI) and out-of-domain (post-contrast MOLLI) inference. The proposed rPCA framework is generic and can be integrated with other registration backbones. ...

Learn & drop

Fast learning of cnns based on layer dropping

Journal article (2024) - Giorgio Cruciata, Luca Cruciata, Liliana Lo Presti, Jan van Gemert, Marco La Cascia

This paper proposes a new method to improve the training efficiency of deep convolutional neural networks. During training, the method evaluates scores to measure how much each layer’s parameters change and whether the layer will continue learning or not. Based on these scores, the network is scaled down such that the number of parameters to be learned is reduced, yielding a speed-up in training. Unlike state-of-the-art methods that try to compress the network to be used in the inference phase or to limit the number of operations performed in the back-propagation phase, the proposed method is novel in that it focuses on reducing the number of operations performed by the network in the forward propagation during training. The proposed training strategy has been validated on two widely used architecture families: VGG and ResNet. Experiments on MNIST, CIFAR-10 and Imagenette show that, with the proposed method, the training time of the models is more than halved without significantly impacting accuracy. The FLOPs reduction in the forward propagation during training ranges from 17.83% for VGG-11 to 83.74% for ResNet-152. As for the accuracy, the impact depends on the depth of the model and the decrease is between 0.26% and 2.38% for VGGs and between 0.4 and 3.2% for ResNets. These results demonstrate the effectiveness of the proposed technique in speeding up learning of CNNs. The technique will be especially useful in applications where fine-tuning or online training of convolutional models is required, for instance because data arrive sequentially. ...

Using and Abusing Equivariance

Conference paper (2023) - T.F. Edixhoven, A. Lengyel, J.C. van Gemert

In this paper we show how Group Equivariant Convolutional Neural Networks use subsampling to learn to break equivariance to the rotation and reflection symmetries. We focus on the 2D rotations and reflections and investigate the impact of the broken equivariance on network performance. We show that a change in the input dimension of a network as small as a single pixel can be enough for commonly used architectures to become approximately equivariant, rather than exactly. We investigate the impact of networks not being exactly equivariant and find that approximately equivariant networks generalise significantly worse to unseen symmetries compared to their exactly equivariant counterparts. However, when the symmetries in the training data are not identical to the symmetries of the network, we find that approximately equivariant networks can relax their equivariance constraints, matching or outperforming exactly equivariant networks on common benchmarks. ...

Color Equivariant Convolutional Networks

Conference paper (2023) - A. Lengyel, O. Strafforello, R. Bruintjes, A.S. Gielisse, J.C. van Gemert

Color is a crucial visual cue readily exploited by Convolutional Neural Networks (CNNs) for object recognition. However, CNNs struggle if there is data imbalance between color variations introduced by accidental recording conditions. Color invariance addresses this issue but does so at the cost of removing all color information, which sacrifices discriminative power. In this paper, we propose Color Equivariant Convolutions (CEConvs), a novel deep learning building block that enables shape feature sharing across the color spectrum while retaining important color information. We extend the notion of equivariance from geometric to photometric transformations by incorporating parameter sharing over hue-shifts in a neural network. We demonstrate the benefits of CEConvs in terms of downstream performance to various tasks and improved robustness to color changes, including train-test distribution shifts. Our approach can be seamlessly integrated into existing architectures, such as ResNets, and offers a promising solution for addressing color-based domain shifts in CNNs. ...

Objects do not disappear

Video object detection by single-frame object location anticipation

Conference paper (2023) - Xin Liu, Jan C. van Gemert, Fatemeh Karimi Nejadasl, Olaf Booij, Silvia L. Pintea

Objects in videos are typically characterized by continuous smooth motion. We exploit continuous smooth motion in three ways. 1) Improved accuracy by using object motion as an additional source of supervision, which we obtain by anticipating object locations from a static keyframe. 2) Improved efficiency by only doing the expensive feature computations on a small subset of all frames. Because neighboring video frames are often redundant, we only compute features for a single static keyframe and predict object locations in subsequent frames. 3) Reduced annotation cost, where we only annotate the keyframe and use smooth pseudo-motion between keyframes. We demonstrate computational efficiency, annotation efficiency, and improved mean average precision compared to the state-of-the-art on four datasets: ImageNet VID, EPIC KITCHENS-55, YouTube-BoundingBoxes and Waymo Open dataset. Our source code is available at https://github.com/L-KID/Video-object-detection-by-location-anticipation. ...

Non-Destructive Infield Quality Estimation of Strawberries using Deep Architectures

Conference paper (2023) - Cees Jol, Junhan Wen, Jan van Gemert

Strawberries are profitable fruits, yet they have a short shelf life. Therefore, it is crucial to anticipate their quality and harvest them at the best time, which is vital not only for finding the appropriate market but also for minimizing food and economic waste. To this end, non-destructive strawberry quality measurements are useful. Much research is conducted on post-harvest strawberries: the fruits were only analyzed after harvesting and thus, these methods cannot be used to find a good time to harvest. Our research targets pre-harvest analysis for supporting the timing decisions of harvests. As such, we used an infield image dataset that was collected during the cultivation of strawberries. The images are labeled by quality assessments and measurements from post-harvest destructive tests. We evaluated deep learning for quality estimation and trained our algorithms to predict the ripeness, firmness, and sweetness of strawberries. Additionally, we applied depth estimation algorithms and shape inpainting models to estimate the size of strawberries using images. Our results demonstrate the feasibility of infield quality attribute prediction. ...

Assessing facial weakness in myasthenia gravis with facial recognition software and deep learning

Journal article (2023) - Annabel M. Ruiter, Ziqi Wang, Zhao Yin, Willemijn C. Naber, Jerrel Simons, Jurre T. Blom, Jan C. van Gemert, Jan J.G.M. Verschuuren, Martijn R. Tannemaat

Objective: Myasthenia gravis (MG) is an autoimmune disease leading to fatigable muscle weakness. Extra-ocular and bulbar muscles are most commonly affected. We aimed to investigate whether facial weakness can be quantified automatically and used for diagnosis and disease monitoring. Methods: In this cross-sectional study, we analyzed video recordings of 70 MG patients and 69 healthy controls (HC) with two different methods. Facial weakness was first quantified with facial expression recognition software. Subsequently, a deep learning (DL) computer model was trained for the classification of diagnosis and disease severity using multiple cross-validations on videos of 50 patients and 50 controls. Results were validated using unseen videos of 20 MG patients and 19 HC. Results: Expression of anger (p = 0.026), fear (p = 0.003), and happiness (p < 0.001) was significantly decreased in MG compared to HC. Specific patterns of decreased facial movement were detectable in each emotion. Results of the DL model for diagnosis were as follows: area under the curve (AUC) of the receiver operator curve 0.75 (95% CI 0.65–0.85), sensitivity 0.76, specificity 0.76, and accuracy 76%. For disease severity: AUC 0.75 (95% CI 0.60–0.90), sensitivity 0.93, specificity 0.63, and accuracy 80%. Results of validation, diagnosis: AUC 0.82 (95% CI: 0.67–0.97), sensitivity 1.0, specificity 0.74, and accuracy 87%. For disease severity: AUC 0.88 (95% CI: 0.67–1.0), sensitivity 1.0, specificity 0.86, and accuracy 94%. Interpretation: Patterns of facial weakness can be detected with facial recognition software. Second, this study delivers a ‘proof of concept’ for a DL model that can distinguish MG from HC and classifies disease severity. ...

Differentiable Transportation Pruning

Conference paper (2023) - Yunqiang Li, Jan C. van Gemert, Torsten Hoefler, Bert Moons, Evangelos Eleftheriou, Bram-Ernst Verhoef

Deep learning algorithms are increasingly employed at the edge. However, edge devices are resource constrained and thus require efficient deployment of deep neural networks. Pruning methods are a key tool for edge deployment as they can improve storage, compute, memory bandwidth, and energy usage. In this paper we propose a novel accurate pruning technique that allows precise control over the output network size. Our method uses an efficient optimal transportation scheme which we make end-to-end differentiable and which automatically tunes the exploration-exploitation behavior of the algorithm to find accurate sparse sub-networks. We show that our method achieves state-of-the-art performance compared to previous pruning methods on 3 different datasets, using 5 different models, across a wide range of pruning ratios, and with two types of sparsity budgets and pruning granularities. ...

SSIG

A Visually-Guided Graph Edit Distance for Floor Plan Similarity

Conference paper (2023) - Casper van Engelenburg, Seyran Khademi, Jan van Gemert

We propose a simple yet effective metric that measures structural similarity between visual instances of architectural floor plans, without the need for learning. Qualitatively, our experiments show that the retrieval results are similar to deeply learned methods. Effectively comparing instances of floor plan data is paramount to the success of machine understanding of floor plan data, including the assessment of floor plan generative models and floor plan recommendation systems. Comparing visual floor plan images goes beyond a sole pixel-wise visual examination and is crucially about similarities and differences in the shapes and relations between subdivisions that compose the layout. Currently, deep metric learning approaches are used to learn a pair-wise vector representation space that closely mimics the structural similarity, in which the models are trained on similarity labels that are obtained by Intersection-over-Union (IoU). To compensate for the lack of structural awareness in IoU, graph-based approaches such as Graph Matching Networks (GMNs) are used, which require pairwise inference for comparing data instances, making GMNs less practical for retrieval applications. In this paper, an effective evaluation metric for judging the structural similarity of floor plans, coined SSIG (Structural Similarity by IoU and GED), is proposed based on both image and graph distances. In addition, an efficient algorithm is developed that uses SSIG to rank a large-scale floor plan database. Code will be openly available. ...

What Affects Learned Equivariance in Deep Image Recognition Models?

Conference paper (2023) - Robert-Jan Bruintjes, Tomasz Motyka, Jan van Gemert

Equivariance w.r.t. geometric transformations in neural networks improves data efficiency, parameter efficiency and robustness to out-of-domain perspective shifts. When equivariance is not designed into a neural network, the network can still learn equivariant functions from the data. We quantify this learned equivariance, by proposing an improved measure for equivariance. We find evidence for a correlation between learned translation equivariance and validation accuracy on ImageNet. We therefore investigate what can increase the learned equivariance in neural networks, and find that data augmentation, reduced model capacity and inductive bias in the form of convolutions induce higher learned equivariance in neural networks. ...

Benchmarking Data Efficiency and Computational Efficiency of Temporal Action Localization Models

Conference paper (2023) - J. Warchocki, T. Oprescu, Y. Wang, A. Dămăcuș, P.M. Misterka, R. Bruintjes, A. Lengyel, O. Strafforello, J.C. van Gemert

In temporal action localization, given an input video, the goal is to predict which actions it contains, where they begin, and where they end. Training and testing current state-of- the-art deep learning models requires access to large amounts of data and computational power. However, gathering such data is challenging and computational resources might be limited. This work explores and measures how current deep temporal action localization models perform in settings constrained by the amount of data or computational power. We measure data efficiency by training each model on a subset of the training set. We find that TemporalMaxer outperforms other models in data-limited settings. Furthermore, we recommend TriDet when training time is limited. To test the efficiency of the models during inference, we pass videos of different lengths through each model. We find that TemporalMaxer requires the least computational resources, likely due to its simple architecture. ...

Are current long-term video understanding datasets long-term?

Conference paper (2023) - Ombretta Strafforello, Klamer Schutte, Jan van Gemert

Many real-world applications, from sport analysis to surveillance, benefit from automatic long-term action recognition. In the current deep learning paradigm for automatic action recognition, it is imperative that models are trained and tested on datasets and tasks that evaluate if such models actually learn and reason over long-term information. In this work, we propose a method to evaluate how suitable a video dataset is to evaluate models for long-term action recognition. To this end, we define a long-term action as excluding all the videos that can be correctly recognized using solely short-term information. We test this definition on existing long-term classification tasks on three popular real-world datasets, namely Breakfast, CrossTask and LVU, to determine if these datasets are truly evaluating long-term recognition. Our study reveals that these datasets can be effectively solved using shortcuts based on short-term information. Following this finding, we encourage long-term action recognition researchers to make use of datasets that need long-term information to be solved. ...

A step towards understanding why classification helps regression

Conference paper (2023) - Silvia L. Pintea, Yancong Lin, Jouke Dijkstra, Jan C. van Gemert

A number of computer vision deep regression approaches report improved results when adding a classification loss to the regression loss. Here, we explore why this is useful in practice and when it is beneficial. To do so, we start from precisely controlled dataset variations and data samplings and find that the effect of adding a classification loss is the most pronounced for regression with imbalanced data. We explain these empirical findings by formalizing the relation between the balanced and imbalanced regression losses. Finally, we show that our findings hold on two real imbalanced image datasets for depth estimation (NYUD2-DIR), and age estimation (IMDB-WIKI-DIR), and on the problem of imbalanced video progress prediction (Breakfast). Our main takeaway is: for a regression task, if the data sampling is imbalanced, then add a classification loss. ...

Understanding weight-magnitude hyperparameters in training binary networks

Conference paper (2023) - Joris Quist, Yunqiang Li, Jan van Gemert

Binary Neural Networks (BNNs) are compact and efficient by using binary weights instead of real-valued weights. Current BNNs use latent real-valued weights during training, where hyper-parameters are inherited from real-valued networks. The interpretation of several of these hyperparameters is based on the magnitude of the real-valued weights. For BNNs, however, the magnitude of binary weights is not meaningful, and thus it is unclear what these hyperparameters actually do. One example is weight-decay, which aims to keep the magnitude of real-valued weights small. Other examples are latent weight initialization, the learning rate, and learning rate decay, which influence the magnitude of the real-valued weights. The magnitude is interpretable for real-valued weights, but loses its meaning for binary weights. In this paper we offer a new interpretation of these magnitude-based hyperparameters based on higher-order gradient filtering during network optimization. Our analysis makes it possible to understand how magnitude-based hyperparameters influence the training of binary networks which allows for new optimization filters specifically designed for binary neural networks that are independent of their real-valued interpretation. Moreover, our improved understanding reduces the number of hyperparameters, which in turn eases the hyperparameter tuning effort which may lead to better hyperparameter values for improved accuracy. Code is available at https: //github.com/jorisquist/Understanding-WM-HP-in-BNNs ...

Video BagNet

Short temporal receptive fields increase robustness in long-term action recognition

Conference paper (2023) - Ombretta Strafforello, Xin Liu, Klamer Schutte, Jan van Gemert

Previous work on long-term video action recognition relies on deep 3D-convolutional models that have a large temporal receptive field (RF). We argue that these models are not always the best choice for temporal modeling in videos. A large temporal receptive field allows the model to encode the exact sub-action order of a video, which causes a performance decrease when testing videos have a different sub-action order. In this work, we investigate whether we can improve the model robustness to the sub-action order by shrinking the temporal receptive field of action recognition models. For this, we design Video BagNet, a variant of the 3D ResNet-50 model with the temporal receptive field size limited to 1, 9, 17 or 33 frames. We analyze Video Bag-Net on synthetic and real-world video datasets and experimentally compare models with varying temporal receptive fields. We find that short receptive fields are robust to sub-action order changes, while larger temporal receptive fields are sensitive to the sub-action order. ...