AG
A.S. Gielisse
info
Please Note
<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>
10 records found
1
Understanding the Value of Depth: RGB-D Fusion and Pseudo-Depth for Robust Out-of-Distribution Generalisation
An Experimental Journey into How Depth Shapes Generalisation in Vision Models
Convolutional neural networks (CNNs) trained on RGB images (red, green, blue channels) often exhibit sharp performance degradation under distribution shifts, as they tend to rely on superficial appearance cues such as background or texture. While depth information is known to provide complementary geometric signals that can improve robustness, most existing approaches assume access to ground-truth depth or rely on complex RGB-D architectures, limiting their applicability in practice.
In this work, we investigate whether estimated depth, obtained from a monocular RGB image, can serve as a simple and effective auxiliary signal to improve out-of-distribution (OOD) generalisation in standard CNN classifiers. Using both controlled toy experiments and real-world evaluations on the NICO++ benchmark, we compare RGB-only models against RGB-D variants that incorporate a single predicted depth channel via minimal fusion. Our results show that pseudo-depth consistently reduces OOD performance gaps across multiple CNN backbones, without degrading in-distribution accuracy. We further demonstrate that these gains persist under moderate corruption of the depth signal and disappear when geometric structure is entirely removed, indicating that the improvements stem from meaningful geometric information rather than the mere presence of an additional input channel. Furthermore, we analyse these effects through class-resolved confusion matrices and qualitative input-level examples, showing that depth specifically attenuates structured semantic confusions under domain shift.
Taken together, our findings suggest that even imperfect, predicted depth can act as a lightweight geometric inductive bias, helping CNN classifiers move away from brittle appearance-based shortcuts and toward more robust representations under domain shift.
https://gitlab.ewi.tudelft.nl/in5000/janvangemert/alexandraioana ...
In this work, we investigate whether estimated depth, obtained from a monocular RGB image, can serve as a simple and effective auxiliary signal to improve out-of-distribution (OOD) generalisation in standard CNN classifiers. Using both controlled toy experiments and real-world evaluations on the NICO++ benchmark, we compare RGB-only models against RGB-D variants that incorporate a single predicted depth channel via minimal fusion. Our results show that pseudo-depth consistently reduces OOD performance gaps across multiple CNN backbones, without degrading in-distribution accuracy. We further demonstrate that these gains persist under moderate corruption of the depth signal and disappear when geometric structure is entirely removed, indicating that the improvements stem from meaningful geometric information rather than the mere presence of an additional input channel. Furthermore, we analyse these effects through class-resolved confusion matrices and qualitative input-level examples, showing that depth specifically attenuates structured semantic confusions under domain shift.
Taken together, our findings suggest that even imperfect, predicted depth can act as a lightweight geometric inductive bias, helping CNN classifiers move away from brittle appearance-based shortcuts and toward more robust representations under domain shift.
https://gitlab.ewi.tudelft.nl/in5000/janvangemert/alexandraioana ...
Convolutional neural networks (CNNs) trained on RGB images (red, green, blue channels) often exhibit sharp performance degradation under distribution shifts, as they tend to rely on superficial appearance cues such as background or texture. While depth information is known to provide complementary geometric signals that can improve robustness, most existing approaches assume access to ground-truth depth or rely on complex RGB-D architectures, limiting their applicability in practice.
In this work, we investigate whether estimated depth, obtained from a monocular RGB image, can serve as a simple and effective auxiliary signal to improve out-of-distribution (OOD) generalisation in standard CNN classifiers. Using both controlled toy experiments and real-world evaluations on the NICO++ benchmark, we compare RGB-only models against RGB-D variants that incorporate a single predicted depth channel via minimal fusion. Our results show that pseudo-depth consistently reduces OOD performance gaps across multiple CNN backbones, without degrading in-distribution accuracy. We further demonstrate that these gains persist under moderate corruption of the depth signal and disappear when geometric structure is entirely removed, indicating that the improvements stem from meaningful geometric information rather than the mere presence of an additional input channel. Furthermore, we analyse these effects through class-resolved confusion matrices and qualitative input-level examples, showing that depth specifically attenuates structured semantic confusions under domain shift.
Taken together, our findings suggest that even imperfect, predicted depth can act as a lightweight geometric inductive bias, helping CNN classifiers move away from brittle appearance-based shortcuts and toward more robust representations under domain shift.
https://gitlab.ewi.tudelft.nl/in5000/janvangemert/alexandraioana
In this work, we investigate whether estimated depth, obtained from a monocular RGB image, can serve as a simple and effective auxiliary signal to improve out-of-distribution (OOD) generalisation in standard CNN classifiers. Using both controlled toy experiments and real-world evaluations on the NICO++ benchmark, we compare RGB-only models against RGB-D variants that incorporate a single predicted depth channel via minimal fusion. Our results show that pseudo-depth consistently reduces OOD performance gaps across multiple CNN backbones, without degrading in-distribution accuracy. We further demonstrate that these gains persist under moderate corruption of the depth signal and disappear when geometric structure is entirely removed, indicating that the improvements stem from meaningful geometric information rather than the mere presence of an additional input channel. Furthermore, we analyse these effects through class-resolved confusion matrices and qualitative input-level examples, showing that depth specifically attenuates structured semantic confusions under domain shift.
Taken together, our findings suggest that even imperfect, predicted depth can act as a lightweight geometric inductive bias, helping CNN classifiers move away from brittle appearance-based shortcuts and toward more robust representations under domain shift.
https://gitlab.ewi.tudelft.nl/in5000/janvangemert/alexandraioana
Much progress in optical flow research has been driven by benchmark datasets. However, these datasets provide only limited feedback on the underlying causes of architectural failures, typically restricted to metrics such as end-point error (EPE), occlusion statistics, and large-displacement ranges. This leads to imprecise claims regarding areas consecutive models have improved upon. In this paper, we present an analysis tool that enables the generation of customisable datasets, allowing controlled variation in displacement size, camera corruptions, luminance, and other factors. We demonstrate the utility of this tool by analysing the behaviour of different architectures under varying displacement sizes and in low-light settings.
...
Much progress in optical flow research has been driven by benchmark datasets. However, these datasets provide only limited feedback on the underlying causes of architectural failures, typically restricted to metrics such as end-point error (EPE), occlusion statistics, and large-displacement ranges. This leads to imprecise claims regarding areas consecutive models have improved upon. In this paper, we present an analysis tool that enables the generation of customisable datasets, allowing controlled variation in displacement size, camera corruptions, luminance, and other factors. We demonstrate the utility of this tool by analysing the behaviour of different architectures under varying displacement sizes and in low-light settings.
Optical flow models excel on synthetic benchmarks but can struggle with real-world scenarios involving large displacements, which are critical for applications like autonomous navigation and augmented reality. To address this, we introduce a novel real-world dataset and evaluation framework, using a specialized annotation tool to capture ground truth optical flow in scenarios with fast movements and close-range objects. Our approach minimizes confounders, providing clear insights into model performance with large displacements. Findings show recent models outperform the previous state-of-the-art, RAFT, across all tested scenarios. Both the annotation tool and dataset are available to support further research.
...
Optical flow models excel on synthetic benchmarks but can struggle with real-world scenarios involving large displacements, which are critical for applications like autonomous navigation and augmented reality. To address this, we introduce a novel real-world dataset and evaluation framework, using a specialized annotation tool to capture ground truth optical flow in scenarios with fast movements and close-range objects. Our approach minimizes confounders, providing clear insights into model performance with large displacements. Findings show recent models outperform the previous state-of-the-art, RAFT, across all tested scenarios. Both the annotation tool and dataset are available to support further research.
Going Against The Flow
Evaluating Optical Flow Estimation Models on Real-World Non-Rigid Motion
Optical flow estimation models are currently trained and evaluated on synthetic datasets. However, the generalizability of these models to real-world applications remains unexplored. This study investigates how well two state-of-the-art optical flow estimation models perform on real-world Articulated, Homothetic, and Conformal non-rigid motion. To facilitate evaluation, a manually annotated dataset comprising twenty-four real-world image pairs and sparse vector fields was created. Both models demonstrated performance consistent with synthetic benchmarks on Homothetic and Conformal motion. However, results degraded when evaluating Articulated motion, revealing limitations in real-world applicability for practical applications such as controlled robotics and object tracking.
...
Optical flow estimation models are currently trained and evaluated on synthetic datasets. However, the generalizability of these models to real-world applications remains unexplored. This study investigates how well two state-of-the-art optical flow estimation models perform on real-world Articulated, Homothetic, and Conformal non-rigid motion. To facilitate evaluation, a manually annotated dataset comprising twenty-four real-world image pairs and sparse vector fields was created. Both models demonstrated performance consistent with synthetic benchmarks on Homothetic and Conformal motion. However, results degraded when evaluating Articulated motion, revealing limitations in real-world applicability for practical applications such as controlled robotics and object tracking.
Optical flow estimation is a core task in computer vision, yet many existing models struggle with lighting-induced appearance changes that are common in real-world scenarios. This work presents a focused evaluation of recent deep learning-based optical flow models under controlled lighting variations, using a custom dataset composed of indoor and outdoor scenes recorded with a static camera. Scenarios include glare, moving shadows, intensity shifts, and outdoor shadows, with ground truth flow defined as zero to isolate the effect of illumination changes. Four models—RAFT, GMFlow, SEA-RAFT, and FlowDiffuser—are benchmarked using standard metrics (EPE and F1-all). The results reveal that even in the absence of physical motion, several models produce significant flow estimates, particularly under shadow and intensity variation. SEA-RAFT and RAFT show relatively higher robustness, while GMFlow and FlowDiffuser are more sensitive to lighting artifacts. The findings highlight a critical gap in current model generalization and emphasize the need for lighting-aware architectures and training strategies.
...
Optical flow estimation is a core task in computer vision, yet many existing models struggle with lighting-induced appearance changes that are common in real-world scenarios. This work presents a focused evaluation of recent deep learning-based optical flow models under controlled lighting variations, using a custom dataset composed of indoor and outdoor scenes recorded with a static camera. Scenarios include glare, moving shadows, intensity shifts, and outdoor shadows, with ground truth flow defined as zero to isolate the effect of illumination changes. Four models—RAFT, GMFlow, SEA-RAFT, and FlowDiffuser—are benchmarked using standard metrics (EPE and F1-all). The results reveal that even in the absence of physical motion, several models produce significant flow estimates, particularly under shadow and intensity variation. SEA-RAFT and RAFT show relatively higher robustness, while GMFlow and FlowDiffuser are more sensitive to lighting artifacts. The findings highlight a critical gap in current model generalization and emphasize the need for lighting-aware architectures and training strategies.
Occlusions are one of the main challenges in optical flow estimation, where parts of the scene are no longer visible between consecutive frames. Several models address this problem, either intrinsically or explicitly, using different strategies. However, most benchmarks rely on synthetic data, and even real-world ones evaluate only overall model performance, without isolating occlusions. This work investigates optical flow model performance under real-world occlusions by introducing a manually annotated, occlusion-focused dataset. We present an annotation method tailored to three occlusion types: out-of-frame, inter-object, and self-occlusion. We then evaluate two models, FlowFormer++ and CCMR, which handle occlusions using different mechanisms. Our findings show that while CCMR demonstrates stronger overall performance, both models struggle with occluded regions, particularly self-occlusions involving rotation and perspective transformations. These results highlight the need for improved occlusion reasoning in models and more diverse real-world benchmarks.
...
Occlusions are one of the main challenges in optical flow estimation, where parts of the scene are no longer visible between consecutive frames. Several models address this problem, either intrinsically or explicitly, using different strategies. However, most benchmarks rely on synthetic data, and even real-world ones evaluate only overall model performance, without isolating occlusions. This work investigates optical flow model performance under real-world occlusions by introducing a manually annotated, occlusion-focused dataset. We present an annotation method tailored to three occlusion types: out-of-frame, inter-object, and self-occlusion. We then evaluate two models, FlowFormer++ and CCMR, which handle occlusions using different mechanisms. Our findings show that while CCMR demonstrates stronger overall performance, both models struggle with occluded regions, particularly self-occlusions involving rotation and perspective transformations. These results highlight the need for improved occlusion reasoning in models and more diverse real-world benchmarks.
Representing CNN Feature Maps with Implicit Neural Representations
A Proof-of-Concept Study Using SIRENs
High-resolution image analysis using deep Convolutional Neural Networks (CNNs) faces significant memory constraints due to the quadratic growth of intermediate feature maps with input resolution. This paper investigates whether Implicit Neural Representations (INRs), specifically SIRENs, can effectively represent CNN feature maps to reduce memory footprint during training. We address the unique challenge that CNN feature maps are not static signals but evolve continuously as network weights are updated through gradient-based optimization. Through three experiments on a modified All-CNN architecture trained on MNIST, we validate that: (1) SIRENs can fit static feature maps from frozen CNNs with high fidelity (PSNR > 30 dB) regardless of weight initialization; (2) SIRENs can track evolving feature maps during training, though with reduced reconstruction quality compared to static targets; and (3) SIREN-assisted feedforward—where SIRENs predict missing activations in receptive fields—enables classification accuracy (20.97%) above random guessing (10%) but substantially below standard training (95%). While results demonstrate the feasibility of using SIRENs to represent dynamic feature maps, significant challenges remain in maintaining reconstruction fidelity when SIRENs are integrated into the training loop. This proof-of-concept study provides empirical insights into bridging continuous implicit representations with discrete deep learning pipelines and highlights promising directions for future research in memory-efficient high-resolution image analysis.
...
High-resolution image analysis using deep Convolutional Neural Networks (CNNs) faces significant memory constraints due to the quadratic growth of intermediate feature maps with input resolution. This paper investigates whether Implicit Neural Representations (INRs), specifically SIRENs, can effectively represent CNN feature maps to reduce memory footprint during training. We address the unique challenge that CNN feature maps are not static signals but evolve continuously as network weights are updated through gradient-based optimization. Through three experiments on a modified All-CNN architecture trained on MNIST, we validate that: (1) SIRENs can fit static feature maps from frozen CNNs with high fidelity (PSNR > 30 dB) regardless of weight initialization; (2) SIRENs can track evolving feature maps during training, though with reduced reconstruction quality compared to static targets; and (3) SIREN-assisted feedforward—where SIRENs predict missing activations in receptive fields—enables classification accuracy (20.97%) above random guessing (10%) but substantially below standard training (95%). While results demonstrate the feasibility of using SIRENs to represent dynamic feature maps, significant challenges remain in maintaining reconstruction fidelity when SIRENs are integrated into the training loop. This proof-of-concept study provides empirical insights into bridging continuous implicit representations with discrete deep learning pipelines and highlights promising directions for future research in memory-efficient high-resolution image analysis.
It is commonly believed that image recognition based on RGB improves when using RGB-D, ie: when depth information (distance from the camera) is added. Adding depth should make models more robust to appearance variations in colors and lighting; to recognize shape and spatial relationships while allowing models to ignore irrelevant backgrounds. In this paper we investigate how robust current RGB-D models truly are to changes in appearance, depth, and background where we vary one modality (RGB or depth) and compare RGB-D to RGB-only and depth-only in a semantic segmentation setting. Experiments show that all investigated RGB-D models show some robustness to variations in color, but might severely fail for unseen variations in lighting, spatial position and backgrounds. Our results show that we need new RGB-D models that can exploit the best of both modalities while remaining robust to changes in a single modality.
...
It is commonly believed that image recognition based on RGB improves when using RGB-D, ie: when depth information (distance from the camera) is added. Adding depth should make models more robust to appearance variations in colors and lighting; to recognize shape and spatial relationships while allowing models to ignore irrelevant backgrounds. In this paper we investigate how robust current RGB-D models truly are to changes in appearance, depth, and background where we vary one modality (RGB or depth) and compare RGB-D to RGB-only and depth-only in a semantic segmentation setting. Experiments show that all investigated RGB-D models show some robustness to variations in color, but might severely fail for unseen variations in lighting, spatial position and backgrounds. Our results show that we need new RGB-D models that can exploit the best of both modalities while remaining robust to changes in a single modality.
Implicit neural representations (INRs) exhibit exceptional compression and generalisation abilities that have enabled striking progress across a variety of applications. These properties have fuelled a growing interest in leveraging INRs for traditional classification tasks as a memory-efficient alternative representation of images, breaking the persistent link between image resolution and associated resource costs. Current INR classification methods face limitations such as a restriction to low-resolution data and sensitivity to image-space transformations. We attribute these issues to the employed INR architecture which lacks mechanisms for local representation, thereby disregarding spatial structure within the data and furthermore limiting their ability to capture high-frequency details. In this work, we propose ARC: Anchored Representation Clouds, a novel INR architecture that explicitly anchors latent vectors in image-space. By introducing spatial structure to the latent vectors, ARC can capture local image data which in our testing leads to state-of-the-art implicit image classification of both low- and high-resolution images and increased robustness against image-space translation.
...
Implicit neural representations (INRs) exhibit exceptional compression and generalisation abilities that have enabled striking progress across a variety of applications. These properties have fuelled a growing interest in leveraging INRs for traditional classification tasks as a memory-efficient alternative representation of images, breaking the persistent link between image resolution and associated resource costs. Current INR classification methods face limitations such as a restriction to low-resolution data and sensitivity to image-space transformations. We attribute these issues to the employed INR architecture which lacks mechanisms for local representation, thereby disregarding spatial structure within the data and furthermore limiting their ability to capture high-frequency details. In this work, we propose ARC: Anchored Representation Clouds, a novel INR architecture that explicitly anchors latent vectors in image-space. By introducing spatial structure to the latent vectors, ARC can capture local image data which in our testing leads to state-of-the-art implicit image classification of both low- and high-resolution images and increased robustness against image-space translation.