Circular Image

J.C. van Gemert

info

Please Note

80 records found

Journal article (2026) - P. Benschop, J.C. van Gemert, J.P. Mense, J.H.G. Dauwels
Video captured for action recognition often contains sensitive appearance cues such as faces, skin color, and clothing. Models trained on such data may exploit these cues rather than the underlying motion, raising privacy concerns in real-world deployment. In this work, we study action recognition under a motion-focused constraint: the model receives only motion representations that capture pixel displacement over time, while reducing appearance cues that expose identity or scene context. We focus on motion-history images and optical flow as learning-free representations that reduce identifiable appearance information while retaining action recognition accuracy. Our motion I3D model achieves approximately 31% and 52% zero-shot top-1 accuracy on HMDB-51 and UCF-101, respectively, outperforming non-CLIP direct-transfer baselines trained on Kinetics-400 despite operating without any appearance input. In 16-shot adaptation, the same model reaches 52% and 83% top-1 accuracy. In the domain adaptation setting on TP-HMDB↔TP-UCF, our motion-focused models achieve higher action recognition accuracy than prior privacy-preserving methods. Sensitive attribute predictability is reduced relative to RGB by a comparable margin, without requiring a learned privacy filter. On PA-HMDB51, optical flow is the strongest motion representation for privacy preservation, approaching chance level for skin-color prediction and remaining below RGB on most privacy attributes, indicating that motion representations retain useful action information while exposing less personal information. ...

A Benchmark Dataset for Floor Plan Generation of Building Complexes

Conference paper (2025) - Casper van Engelenburg, Fatemeh Mostafavi, Emanuel Kuhn, Yuntae Jeon, Michael Franzen, Matthias Standfest, Jan van Gemert, Seyran Khademi
Diverse and realistic floor plan data are essential for the development of useful computer-aided methods in architectural design. Today’s large-scale floor plan datasets predominantly feature simple floor plan layouts, typically representing single-apartment dwellings only. To compensate for the mismatch between current datasets and the real world, we develop Modified Swiss Dwellings (MSD) – the first large-scale floor plan dataset that contains a significant share of layouts of multi-apartment dwellings. MSD features over 5.3K floor plans of medium- to large-scale building complexes, covering over 18.9K distinct apartments. We validate that existing approaches for floor plan generation, while effective in simpler scenarios, cannot yet seamlessly address the challenges posed by MSD. Our benchmark calls for new research in floor plan machine understanding. Code and data are open. ...

Hierarchical Stochastic Neighbor Embedding for Accelerated Video ANnotAtions

Conference paper (2025) - Alexandru Bobe, Jan C. van Gemert
Video annotation is a critical and time-consuming task in computer vision research and applications. This paper presents a novel annotation pipeline that uses pre-extracted features and dimensionality reduction to accelerate the temporal video annotation process. Our approach uses Hierarchical Stochastic Neighbor Embedding (HSNE) to create a multi-scale representation of video features, allowing annotators to efficiently explore and label large video datasets. We demonstrate significant improvements in annotation effort compared to traditional linear methods, achieving more than a 10x reduction in clicks required for annotating over 12 h of video. Our experiments on multiple datasets show the effectiveness and robustness of our pipeline across various scenarios. Moreover, we investigate the optimal configuration of HSNE parameters for different datasets. Our work provides a promising direction for scaling up video annotation efforts in the era of video understanding. ...

A Compact Mamba Network for Speech Denoising using Channel Pruning

Conference paper (2025) - Sjoerd Groot, Qinyu Chen, Jan C. Van Gemert, Chang Gao
This paper presents CleanUMamba, a time-domain neural network architecture designed for real-time causal audio denoising directly applied to raw waveforms. CleanUMamba leverages a U-Net encoder-decoder structure, incorporating the Mamba state-space model in the bottleneck layer. By replacing conventional self-attention and LSTM mechanisms with Mamba, our architecture offers superior denoising performance while maintaining a constant memory footprint, enabling streaming operation. To enhance efficiency, we applied structured channel pruning, achieving an 8X reduction in model size without compromising audio quality. Our model demonstrates strong results in the Interspeech 2020 Deep Noise Suppression challenge. Specifically, CleanUMamba achieves a PESQ score of 2.42 and STOI of 95.1% with only 442K parameters and 468M MACs, matching or outperforming larger models in real-time performance. Code will be available at: https://github.com/lab-emi/CleanUMamba ...
Conference paper (2024) - Xinqi Li, Yi Zhang, Yidong Zhao, Jan van Gemert, Qian Tao
Quantitative cardiac magnetic resonance imaging (MRI) is an increasingly important diagnostic tool for cardiovascular diseases. Yet, co-registration of all baseline images within the quantitative MRI sequence is essential for the accuracy and precision of quantitative maps. However, co-registering all baseline images from a quantitative cardiac MRI sequence remains a nontrivial task because of the simultaneous changes in intensity and contrast, in combination with cardiac and respiratory motion. To address the challenge, we propose a novel motion correction framework based on robust principle component analysis (rPCA) that decomposes quantitative cardiac MRI into low-rank and sparse components, and we integrate the groupwise CNN-based registration backbone within the rPCA framework. The low-rank component of rPCA corresponds to the quantitative mapping (i.e. limited degree of freedom in variation), while the sparse component corresponds to the residual motion, making it easier to formulate and solve the groupwise registration problem. We evaluated our proposed method on cardiac T1 mapping by the modified Look-Locker inversion recovery (MOLLI) sequence, both before and after the Gadolinium contrast agent administration. Our experiments showed that our method effectively improved registration performance over baseline methods without introducing rPCA, and reduced quantitative mapping error in both in-domain (pre-contrast MOLLI) and out-of-domain (post-contrast MOLLI) inference. The proposed rPCA framework is generic and can be integrated with other registration backbones. ...

Fast learning of cnns based on layer dropping

Journal article (2024) - Giorgio Cruciata, Luca Cruciata, Liliana Lo Presti, Jan van Gemert, Marco La Cascia
This paper proposes a new method to improve the training efficiency of deep convolutional neural networks. During training, the method evaluates scores to measure how much each layer’s parameters change and whether the layer will continue learning or not. Based on these scores, the network is scaled down such that the number of parameters to be learned is reduced, yielding a speed-up in training. Unlike state-of-the-art methods that try to compress the network to be used in the inference phase or to limit the number of operations performed in the back-propagation phase, the proposed method is novel in that it focuses on reducing the number of operations performed by the network in the forward propagation during training. The proposed training strategy has been validated on two widely used architecture families: VGG and ResNet. Experiments on MNIST, CIFAR-10 and Imagenette show that, with the proposed method, the training time of the models is more than halved without significantly impacting accuracy. The FLOPs reduction in the forward propagation during training ranges from 17.83% for VGG-11 to 83.74% for ResNet-152. As for the accuracy, the impact depends on the depth of the model and the decrease is between 0.26% and 2.38% for VGGs and between 0.4 and 3.2% for ResNets. These results demonstrate the effectiveness of the proposed technique in speeding up learning of CNNs. The technique will be especially useful in applications where fine-tuning or online training of convolutional models is required, for instance because data arrive sequentially. ...
Journal article (2024) - Mark Basting, Robert Jan Bruintjes, Thaddäus Wiedemer, Matthias Kümmerer, Matthias Bethge, Jan van Gemert
Objects can take up an arbitrary number of pixels in an image: Objects come in different sizes, and, photographs of these objects may be taken at various distances to the camera. These pixel size variations are problematic for CNNs, causing them to learn separate filters for scaled variants of the same objects which prevents learning across scales. This is addressed by scale-equivariant approaches that share features across a set of pre-determined fixed internal scales. These works, however, give little information about how to best choose the internal scales when the underlying distribution of sizes, or scale distribution, in the dataset, is unknown. In this work we investigate learning the internal scales distribution in scale-equivariant CNNs, allowing them to adapt to unknown data scale distributions. We show that our method can learn the internal scales on various data scale distributions and can adapt the internal scales in current scale-equivariant approaches. ...
Conference paper (2023) - Yunqiang Li, Jan C. van Gemert, Torsten Hoefler, Bert Moons, Evangelos Eleftheriou, Bram-Ernst Verhoef
Deep learning algorithms are increasingly employed at the edge. However, edge devices are resource constrained and thus require efficient deployment of deep neural networks. Pruning methods are a key tool for edge deployment as they can improve storage, compute, memory bandwidth, and energy usage. In this paper we propose a novel accurate pruning technique that allows precise control over the output network size. Our method uses an efficient optimal transportation scheme which we make end-to-end differentiable and which automatically tunes the exploration-exploitation behavior of the algorithm to find accurate sparse sub-networks. We show that our method achieves state-of-the-art performance compared to previous pruning methods on 3 different datasets, using 5 different models, across a wide range of pruning ratios, and with two types of sparsity budgets and pruning granularities. ...
Journal article (2023) - Annabel M. Ruiter, Ziqi Wang, Zhao Yin, Willemijn C. Naber, Jerrel Simons, Jurre T. Blom, Jan C. van Gemert, Jan J.G.M. Verschuuren, Martijn R. Tannemaat
Objective: Myasthenia gravis (MG) is an autoimmune disease leading to fatigable muscle weakness. Extra-ocular and bulbar muscles are most commonly affected. We aimed to investigate whether facial weakness can be quantified automatically and used for diagnosis and disease monitoring. Methods: In this cross-sectional study, we analyzed video recordings of 70 MG patients and 69 healthy controls (HC) with two different methods. Facial weakness was first quantified with facial expression recognition software. Subsequently, a deep learning (DL) computer model was trained for the classification of diagnosis and disease severity using multiple cross-validations on videos of 50 patients and 50 controls. Results were validated using unseen videos of 20 MG patients and 19 HC. Results: Expression of anger (p = 0.026), fear (p = 0.003), and happiness (p < 0.001) was significantly decreased in MG compared to HC. Specific patterns of decreased facial movement were detectable in each emotion. Results of the DL model for diagnosis were as follows: area under the curve (AUC) of the receiver operator curve 0.75 (95% CI 0.65–0.85), sensitivity 0.76, specificity 0.76, and accuracy 76%. For disease severity: AUC 0.75 (95% CI 0.60–0.90), sensitivity 0.93, specificity 0.63, and accuracy 80%. Results of validation, diagnosis: AUC 0.82 (95% CI: 0.67–0.97), sensitivity 1.0, specificity 0.74, and accuracy 87%. For disease severity: AUC 0.88 (95% CI: 0.67–1.0), sensitivity 1.0, specificity 0.86, and accuracy 94%. Interpretation: Patterns of facial weakness can be detected with facial recognition software. Second, this study delivers a ‘proof of concept’ for a DL model that can distinguish MG from HC and classifies disease severity. ...

Video object detection by single-frame object location anticipation

Conference paper (2023) - Xin Liu, Jan C. van Gemert, Fatemeh Karimi Nejadasl, Olaf Booij, Silvia L. Pintea
Objects in videos are typically characterized by continuous smooth motion. We exploit continuous smooth motion in three ways. 1) Improved accuracy by using object motion as an additional source of supervision, which we obtain by anticipating object locations from a static keyframe. 2) Improved efficiency by only doing the expensive feature computations on a small subset of all frames. Because neighboring video frames are often redundant, we only compute features for a single static keyframe and predict object locations in subsequent frames. 3) Reduced annotation cost, where we only annotate the keyframe and use smooth pseudo-motion between keyframes. We demonstrate computational efficiency, annotation efficiency, and improved mean average precision compared to the state-of-the-art on four datasets: ImageNet VID, EPIC KITCHENS-55, YouTube-BoundingBoxes and Waymo Open dataset. Our source code is available at https://github.com/L-KID/Video-object-detection-by-location-anticipation. ...
Conference paper (2023) - Cees Jol, Junhan Wen, Jan van Gemert
Strawberries are profitable fruits, yet they have a short shelf life. Therefore, it is crucial to anticipate their quality and harvest them at the best time, which is vital not only for finding the appropriate market but also for minimizing food and economic waste. To this end, non-destructive strawberry quality measurements are useful. Much research is conducted on post-harvest strawberries: the fruits were only analyzed after harvesting and thus, these methods cannot be used to find a good time to harvest. Our research targets pre-harvest analysis for supporting the timing decisions of harvests. As such, we used an infield image dataset that was collected during the cultivation of strawberries. The images are labeled by quality assessments and measurements from post-harvest destructive tests. We evaluated deep learning for quality estimation and trained our algorithms to predict the ripeness, firmness, and sweetness of strawberries. Additionally, we applied depth estimation algorithms and shape inpainting models to estimate the size of strawberries using images. Our results demonstrate the feasibility of infield quality attribute prediction. ...

Mixed methods for linking research in the humanities and in information technology (ArchiMediaL)

Book chapter (2023) - Tino Mager, Seyran Khademi, Ronald Siebes, Jan van Gemert, Victor de Boer, Beate Löffler, Carola Hein
Information on the history of architecture is embedded in our daily surroundings, in vernacular and heritage buildings and in physical objects, photographs and plans. Historians study these tangible and intangible artefacts and the communities that built and used them. Thus valuable insights are gained into the past and the present as they also provide a foundation for designing the future. Given that our understanding of the past is limited by the inadequate availability of data, the article demonstrates that advanced computer tools can help gain more and well-linked data from the past. Computer vision can make a decisive contribution to the identification of image content in historical photographs. This application is particularly interesting for architectural history, where visual sources play an essential role in understanding the built environment of the past, yet lack of reliable metadata often hinders the use of materials. The automated recognition contributes to making a variety of image sources usable for research. ...
Color is a crucial visual cue readily exploited by Convolutional Neural Networks (CNNs) for object recognition. However, CNNs struggle if there is data imbalance between color variations introduced by accidental recording conditions. Color invariance addresses this issue but does so at the cost of removing all color information, which sacrifices discriminative power. In this paper, we propose Color Equivariant Convolutions (CEConvs), a novel deep learning building block that enables shape feature sharing across the color spectrum while retaining important color information. We extend the notion of equivariance from geometric to photometric transformations by incorporating parameter sharing over hue-shifts in a neural network. We demonstrate the benefits of CEConvs in terms of downstream performance to various tasks and improved robustness to color changes, including train-test distribution shifts. Our approach can be seamlessly integrated into existing architectures, such as ResNets, and offers a promising solution for addressing color-based domain shifts in CNNs. ...
Conference paper (2023) - Frans de Boer, Jan C. van Gemert, Jouke Dijkstra, Silvia L. Pintea
Activity progress prediction aims to estimate what percentage of an activity has been completed. Currently this is done with machine learning approaches, trained and evaluated on complicated and realistic video datasets. The videos in these datasets vary drastically in length and appearance. And some of the activities have unanticipated developments, making activity progression difficult to estimate. In this work, we examine the results obtained by existing progress prediction methods on these datasets. We find that current progress prediction methods seem not to extract useful visual information for the progress prediction task. Therefore, these methods fail to exceed simple frame-counting baselines. We design a precisely controlled dataset for activity progress prediction and on this synthetic dataset we show that the considered methods can make use of the visual information, when this directly relates to the progress prediction. We conclude that the progress prediction task is ill-posed on the currently used real-world datasets. Moreover, to fairly measure activity progression we advise to consider a, simple but effective, frame-counting baseline. ...
In temporal action localization, given an input video, the goal is to predict which actions it contains, where they begin, and where they end. Training and testing current state-of- the-art deep learning models requires access to large amounts of data and computational power. However, gathering such data is challenging and computational resources might be limited. This work explores and measures how current deep temporal action localization models perform in settings constrained by the amount of data or computational power. We measure data efficiency by training each model on a subset of the training set. We find that TemporalMaxer outperforms other models in data-limited settings. Furthermore, we recommend TriDet when training time is limited. To test the efficiency of the models during inference, we pass videos of different lengths through each model. We find that TemporalMaxer requires the least computational resources, likely due to its simple architecture. ...

Learnable Activation Binarizer for Binary Neural Networks

Conference paper (2023) - Sieger Falkena, Hadi Jamali-Rad, Jan van Gemert
Binary Neural Networks (BNNs) are receiving an up-surge of attention for bringing power-hungry deep learning towards edge devices. The traditional wisdom in this space is to employ sign(.) for binarizing feature maps. We argue and illustrate that sign(.) is a uniqueness bottleneck, limiting information propagation throughout the network. To alleviate this, we propose to dispense sign(.), replacing it with a learnable activation binarizer (LAB), allowing the network to learn a fine-grained binarization kernel per layer - as opposed to global thresholding. LAB is a novel universal module that can seamlessly be integrated into existing architectures. To confirm this, we plug it into four seminal BNNs and show a considerable accuracy boost at the cost of tolerable increase in delay and complexity. Finally, we build an end-to-end BNN (coined as LAB-BNN) around LAB, and demonstrate that it achieves competitive performance on par with the state-of-the-art on ImageNet. Our code can be found in our repository: https://github.com/sfalkena/LAB. ...
Conference paper (2023) - Ombretta Strafforello, Klamer Schutte, Jan van Gemert
Many real-world applications, from sport analysis to surveillance, benefit from automatic long-term action recognition. In the current deep learning paradigm for automatic action recognition, it is imperative that models are trained and tested on datasets and tasks that evaluate if such models actually learn and reason over long-term information. In this work, we propose a method to evaluate how suitable a video dataset is to evaluate models for long-term action recognition. To this end, we define a long-term action as excluding all the videos that can be correctly recognized using solely short-term information. We test this definition on existing long-term classification tasks on three popular real-world datasets, namely Breakfast, CrossTask and LVU, to determine if these datasets are truly evaluating long-term recognition. Our study reveals that these datasets can be effectively solved using shortcuts based on short-term information. Following this finding, we encourage long-term action recognition researchers to make use of datasets that need long-term information to be solved. ...
Conference paper (2023) - Silvia L. Pintea, Yancong Lin, Jouke Dijkstra, Jan C. van Gemert
A number of computer vision deep regression approaches report improved results when adding a classification loss to the regression loss. Here, we explore why this is useful in practice and when it is beneficial. To do so, we start from precisely controlled dataset variations and data samplings and find that the effect of adding a classification loss is the most pronounced for regression with imbalanced data. We explain these empirical findings by formalizing the relation between the balanced and imbalanced regression losses. Finally, we show that our findings hold on two real imbalanced image datasets for depth estimation (NYUD2-DIR), and age estimation (IMDB-WIKI-DIR), and on the problem of imbalanced video progress prediction (Breakfast). Our main takeaway is: for a regression task, if the data sampling is imbalanced, then add a classification loss. ...

A Visually-Guided Graph Edit Distance for Floor Plan Similarity

We propose a simple yet effective metric that measures structural similarity between visual instances of architectural floor plans, without the need for learning. Qualitatively, our experiments show that the retrieval results are similar to deeply learned methods. Effectively comparing instances of floor plan data is paramount to the success of machine understanding of floor plan data, including the assessment of floor plan generative models and floor plan recommendation systems. Comparing visual floor plan images goes beyond a sole pixel-wise visual examination and is crucially about similarities and differences in the shapes and relations between subdivisions that compose the layout. Currently, deep metric learning approaches are used to learn a pair-wise vector representation space that closely mimics the structural similarity, in which the models are trained on similarity labels that are obtained by Intersection-over-Union (IoU). To compensate for the lack of structural awareness in IoU, graph-based approaches such as Graph Matching Networks (GMNs) are used, which require pairwise inference for comparing data instances, making GMNs less practical for retrieval applications. In this paper, an effective evaluation metric for judging the structural similarity of floor plans, coined SSIG (Structural Similarity by IoU and GED), is proposed based on both image and graph distances. In addition, an efficient algorithm is developed that uses SSIG to rank a large-scale floor plan database. Code will be openly available. ...

Short temporal receptive fields increase robustness in long-term action recognition

Conference paper (2023) - Ombretta Strafforello, Xin Liu, Klamer Schutte, Jan van Gemert
Previous work on long-term video action recognition relies on deep 3D-convolutional models that have a large temporal receptive field (RF). We argue that these models are not always the best choice for temporal modeling in videos. A large temporal receptive field allows the model to encode the exact sub-action order of a video, which causes a performance decrease when testing videos have a different sub-action order. In this work, we investigate whether we can improve the model robustness to the sub-action order by shrinking the temporal receptive field of action recognition models. For this, we design Video BagNet, a variant of the 3D ResNet-50 model with the temporal receptive field size limited to 1, 9, 17 or 33 frames. We analyze Video Bag-Net on synthetic and real-world video datasets and experimentally compare models with varying temporal receptive fields. We find that short receptive fields are robust to sub-action order changes, while larger temporal receptive fields are sensitive to the sub-action order. ...