Holger Caesar | TU Delft Repository

Enhancing the dependability of autonomous surface vehicles through robustness benchmarking of real-time object detection models

Journal article (2026) - Yunjia Wang, Zihao Zhang, Kaizheng Wang, Holger Caesar, Jeroen Boydens, Davy Pissoort, Mathias Verbeke

The Autonomous Surface Vehicle (ASV) market is expected to double by 2030, rapidly transforming maritime logistics through faster deliveries, lower costs, reduced risks from human error, and the potential to save human lives. ASVs depend on robust object detection models to ensure safe navigation. However, existing models are often susceptible to natural corruptions such as blur, noise, adverse weather, and occlusions-risks to perception robustness further intensified by the lack of domain-specific robustness benchmarks. To fill this gap, we propose the first waterborne-focused robustness benchmark, incorporating 25 synthetic corruptions (15 adapted from ImageNet-C plus 10 novel ones for ASVs) across five severity levels. We also incorporate mixed corruptions to capture real-world complexity. Building on three public waterborne datasets (SeaShips, SMD, SSAVE), we create SeaShips-C, SMD-C, and SSAVE-C, each augmented with our corruption suite. A comprehensive robustness evaluation is conducted on multiple sizes of YOLOv8, SSD, NanoDet-Plus, and RT-DETR, revealing critical vulnerabilities: e.g., YOLOv8n's mAP⁵⁰ drops by 43.0 % under contrast corruption on SeaShips-C, reaching a 59.5 % decline when combined with raindrops. Larger variants (e.g., YOLOv8x) exhibit greater robustness, offering insights for safer deployments. Aligned with ISO/IEC TR 5469 and IEC 61508, our benchmark supports pre-deployment verification. By identifying risk-prone conditions, practitioners can apply targeted mitigation strategies, such as data augmentation and human oversight. To promote further research and support industrial practice, we provide open access to all benchmark datasets and code-which can also serve as a data augmentation resource to enhance model training. ...

Advancing High-Resolution and Efficient Automotive Radar Imaging through Domain-Informed 1D Deep Learning

Conference paper (2025) - Ruxin Zheng, Shunqiao Sun, Hongshan Liu, Holger Caesar, Honglei Chen, Jian Li

Millimeter-wave (mmWave) radars are critical for autonomous vehicles' perception tasks, offering reliable performance in adverse weather conditions. However, their application is often hindered by insufficient spatial resolution for detailed semantic scene interpretation. Traditional super-resolution methods derived from optical imaging fail to accommodate the unique properties of radar signals. Addressing this, our study redefines radar imaging superresolution as a one-dimensional (1D) signal super-resolution spectra estimation problem, leveraging domain-specific insights to innovate data normalization and introduce a domain-informed signal-tonoise ratio (SNR)-guided loss function. Our custom deep learning network, tailored for automotive radar imaging, achieves substantial improvements in parameter efficiency, and inference speed while enhancing image quality and resolution. Comprehensive tests demonstrate that our SR-SPECNet establishes a new standard for high-resolution radar range-azimuth imaging, surpassing previous methods. Source code and new radar dataset will be made publicly available at https://github.com/ruxinzh/SR DOA. ...

NeuroNCAP

Photorealistic Closed-Loop Safety Testing for Autonomous Driving

Conference paper (2025) - William Ljungbergh, Adam Tonderski, Joakim Johnander, Holger Caesar, Kalle Åström, Michael Felsberg, Christoffer Petersson

We present a versatile NeRF-based simulator for testing autonomous driving (AD) software systems, designed with a focus on sensor-realistic closed-loop evaluation and the creation of safety-critical scenarios. The simulator learns from sequences of real-world driving sensor data and enables reconfigurations and renderings of new, unseen scenarios. In this work, we use our simulator to test the responses of AD models to safety-critical scenarios inspired by the European New Car Assessment Programme (Euro NCAP). Our evaluation reveals that, while state-of-the-art end-to-end planners excel in nominal driving scenarios in an open-loop setting, they exhibit critical flaws when navigating our safety-critical scenarios in a closed-loop setting. This highlights the need for advancements in the safety and real-world usability of end-to-end planners. By publicly releasing our simulator and scenarios as an easy-to-run evaluation suite, we invite the research community to explore, refine, and validate their AD models in controlled, yet highly configurable and challenging sensor-realistic environments. ...

BikeScenes: Online LiDAR Semantic Segmentation for Bicycles

Preprint (2025) - Holger Caesar, D. Goren

The vulnerability of cyclists, exacerbated by the rising popularity of faster e-bikes, motivates adapting automotive perception technologies for bicycle safety. We use our multi-sensor 'SenseBike' research platform to develop and evaluate a 3D LiDAR segmentation approach tailored to bicycles. To bridge the automotive-to-bicycle domain gap, we introduce the novel BikeScenes-lidarseg Dataset, comprising 3021 consecutive LiDAR scans around the university campus of the TU Delft, semantically annotated for 29 dynamic and static classes. By evaluating model performance, we demonstrate that fine-tuning on our BikeScenes dataset achieves a mean Intersection-over-Union (mIoU) of 63.6%, significantly outperforming the 13.8% obtained with SemanticKITTI pre-training alone. This result underscores the necessity and effectiveness of domain-specific training. We highlight key challenges specific to bicycle-mounted, hardware-constrained perception systems and contribute the BikeScenes dataset as a resource for advancing research in cyclist-centric LiDAR segmentation. ...

4D-RaDiff: Latent Diffusion for 4D Radar Point Cloud Generation

Preprint (2025) - J.C.K. Kwok, Holger Caesar, A. Palffy

Automotive radar has shown promising developments in environment perception due to its cost-effectiveness and robustness in adverse weather conditions. However, the limited availability of annotated radar data poses a significant challenge for advancing radar-based perception systems. To address this limitation, we propose a novel framework to generate 4D radar point clouds for training and evaluating object detectors. Unlike image-based diffusion, our method is designed to consider the sparsity and unique characteristics of radar point clouds by applying diffusion to a latent point cloud representation. Within this latent space, generation is controlled via conditioning at either the object or scene level. The proposed 4D-RaDiff converts unlabeled bounding boxes into high-quality radar annotations and transforms existing LiDAR point cloud data into realistic radar scenes. Experiments demonstrate that incorporating synthetic radar data of 4D-RaDiff as data augmentation method during training consistently improves object detection performance compared to training on real data only. In addition, pre-training on our synthetic data reduces the amount of required annotated radar data by up to 90% while achieving comparable object detection performance. ...

ECCV 2024 W-CODA

1st Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving

Preprint (2025) - Kai Chen, Ruiyuan Gao, Lanqing Hong, Hang Xu, Jia Xu, Holger Caesar, More Authors...

In this paper, we present details of the 1st W-CODA workshop, held in conjunction with the ECCV 2024. W-CODA aims to explore next-generation solutions for autonomous driving corner cases, empowered by state-of-the-art multimodal perception and comprehension techniques. 5 Speakers from both academia and industry are invited to share their latest progress and opinions. We collect research papers and hold a dual-track challenge, including both corner case scene understanding and generation. As the pioneering effort, we will continuously bridge the gap between frontier autonomous driving techniques and fully intelligent, reliable self-driving agents robust towards corner cases.
...

VoteFlow: Enforcing Local Rigidity in Self-Supervised Scene Flow

Conference paper (2025) - Y. Lin, S. Wang, L. Nan, J.F.P. Kooij, Holger Caesar

Scene flow estimation aims to recover per-point motion from two adjacent LiDAR scans. However, in real-world applications such as autonomous driving, points rarely move independently of others, especially for nearby points belonging to the same object, which often share the same motion. Incorporating this locally rigid motion constraint has been a key challenge in self-supervised scene flow estimation, which is often addressed by post-processing or appending extra regularization. While these approaches are able to improve the rigidity of predicted flows, they lack an architectural inductive bias for local rigidity within the model structure, leading to suboptimal learning efficiency and inferior performance. In contrast, we enforce local rigidity with a lightweight add-on module in neural network design, enabling end-to-end learning. We design a discretized voting space that accommodates all possible translations and then identify the one shared by nearby points by differentiable voting. Additionally, to ensure computational efficiency, we operate on pillars rather than points and learn representative features for voting per pillar. We plug the Voting Module into popular model designs and evaluate its benefit on Argoverse 2 and Waymo datasets. We outperform baseline works with only marginal compute overhead. Code is available at https://github.com/tudelft-iv/VoteFlow. ...

OpenPSG

Open-Set Panoptic Scene Graph Generation via Large Multimodal Models

Conference paper (2025) - Zijian Zhou, Zheng Zhu, Holger Caesar, Miaojing Shi

Panoptic Scene Graph Generation (PSG) aims to segment objects and recognize their relations, enabling the structured understanding of an image. Previous methods focus on predicting predefined object and relation categories, hence limiting their applications in the open world scenarios. With the rapid development of large multimodal models (LMMs), significant progress has been made in open-set object detection and segmentation, yet open-set relation prediction in PSG remains unexplored. In this paper, we focus on the task of open-set relation prediction integrated with a pretrained open-set panoptic segmentation model to achieve true open-set panoptic scene graph generation (OpenPSG). Our OpenPSG leverages LMMs to achieve open-set relation prediction in an autoregressive manner. We introduce a relation query transformer to efficiently extract visual features of object pairs and estimate the existence of relations between them. The latter can enhance the prediction efficiency by filtering irrelevant pairs. Finally, we design the generation and judgement instructions to perform open-set relation prediction in PSG autoregressively. To our knowledge, we are the first to propose the open-set PSG task. Extensive experiments demonstrate that our method achieves state-of-the-art performance in open-set relation prediction and panoptic scene graph generation. ...

Seeing Clearly, Forgetting Deeply: Revisiting Fine-Tuned Video Generators for Driving Simulation

Preprint (2025) - C. Chang, Chen-Yu Wang, Julian Schmidt, Holger Caesar, Alain Pagani

Recent advancements in video generation have substantially improved visual quality and temporal coherence, making these models increasingly appealing for applications such as autonomous driving, particularly in the context of driving simulation and so-called "world models". In this work, we investigate the effects of existing fine-tuning video generation approaches on structured driving datasets and uncover a potential trade-off: although visual fidelity improves, spatial accuracy in modeling dynamic elements may degrade. We attribute this degradation to a shift in the alignment between visual quality and dynamic understanding objectives. In datasets with diverse scene structures within temporal space, where objects or perspective shift in varied ways, these objectives tend to highly correlated. However, the very regular and repetitive nature of driving scenes allows visual quality to improve by modeling dominant scene motion patterns, without necessarily preserving fine-grained dynamic behavior. As a result, fine-tuning encourages the model to prioritize surface-level realism over dynamic accuracy. To further examine this phenomenon, we show that simple continual learning strategies, such as replay from diverse domains, can offer a balanced alternative by preserving spatial accuracy while maintaining strong visual quality. ...

A Vehicle System for Navigating Among Vulnerable Road Users Including Remote Operation

Conference paper (2025) - O. De Groot, A. Bertipaglia, F. Tajdari, S. Wang, Z. Xia, M. Zaffar, R. Ensing, M. Garzon, J. Alonso-Mora, H. Caesar, L. Ferranti, R. Happee, H. Boekema, J. F.P. Kooij, G. Papaioannou, B. Shyrokau, D. M. Gavrila, V. Jain, M. Kegl, V. Kotian, T. Lentsch, Y. Lin, C. Messiou, E. Schippers

We present a vehicle system capable of navigating safely and efficiently around Vulnerable Road Users (VRUs), such as pedestrians and cyclists. The system comprises key modules for environment perception, localization and mapping, motion planning, and control, integrated into a prototype vehicle. A key innovation is a motion planner based on Topology-driven Model Predictive Control (T-MPC). The guidance layer generates multiple trajectories in parallel, each representing a distinct strategy for obstacle avoidance or non-passing. The underlying trajectory optimization constrains the joint probability of collision with VRUs under generic uncertainties. To address extraordinary situations ('edge cases') that go beyond the autonomous capabilities - such as construction zones or encounters with emergency responders - the system includes an option for remote human operation, supported by visual and haptic guidance. In simulation, our motion planner outperforms three baseline approaches in terms of safety and efficiency. We also demonstrate the full system in prototype vehicle tests on a closed track, both in autonomous and remotely operated modes. ...

Mobility Futures

Four scenarios for the Dutch mobility system in 2050

Book (2025) - B. Atasoy, Holger Caesar, Wijnand Veeneman, Roelof Vos, S.P. Hoogendoorn, S. Hoogendoorn-Lanser, Deborah Nas, N. van Nes, K. Spoor, M.P. Swarte, Joost Ellerbroek, S. Hiemstra-van Mastrigt, M.Y. Maknoon, N. van Oort, A. Psyllidis, S.C. van der Spek, M. Snelder, M. Triggianese

Mobility is vital for societal wellbeing, economic growth, social inclusion, and access to essential amenities. However, the current system faces significant challenges, including environmental impact, unequal access, and safety concerns. […] ...

ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation

Preprint (2025) - Simon Boeder, Fabian Gigengack, Simon Roesler, Holger Caesar, Benjamin Risse

Recent progress in self- and weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and severe depth bleeding. We thus introduce ShelfOcc, a vision-only method that overcomes these limitations without relying on LiDAR. ShelfOcc brings supervision into native 3D space by generating metrically consistent semantic voxel labels from video, enabling true 3D supervision without any additional sensors or manual 3D annotations. While recent vision-based 3D geometry foundation models provide a promising source of prior knowledge, they do not work out of the box as a prediction due to sparse or noisy and inconsistent geometry, especially in dynamic driving scenes. Our method introduces a dedicated framework that mitigates these issues by filtering and accumulating static geometry consistently across frames, handling dynamic content and propagating semantic information into a stable voxel representation. This data-centric shift in supervision for weakly/shelf-supervised occupancy estimation allows the use of essentially any SOTA occupancy model architecture without relying on LiDAR data. We argue that such high-quality supervision is essential for robust occupancy learning and constitutes an important complementary avenue to architectural innovation. On the Occ3D-nuScenes benchmark, ShelfOcc substantially outperforms all previous weakly/shelf-supervised methods (up to a 34% relative improvement), establishing a new data-driven direction for LiDAR-free 3D scene understanding. ...

Material-informed Gaussian Splatting for 3D World Reconstruction in a Digital Twin

Preprint (2025) - A.K.G.H. Huynh, João Malheiro Silva, Holger Caesar, Tong Duy Son

3D reconstruction for Digital Twins often relies on LiDAR-based methods, which provide accurate geometry but lack the semantics and textures naturally captured by cameras. Traditional LiDAR-camera fusion approaches require complex calibration and still struggle with certain materials like glass, which are visible in images but poorly represented in point clouds. We propose a camera-only pipeline that reconstructs scenes using 3D Gaussian Splatting from multi-view images, extracts semantic material masks via vision models, converts Gaussian representations to mesh surfaces with projected material labels, and assigns physics-based material properties for accurate sensor simulation in modern graphics engines and simulators. This approach combines photorealistic reconstruction with physics-based material assignment, providing sensor simulation fidelity comparable to LiDAR-camera fusion while eliminating hardware complexity and calibration requirements. We validate our camera-only method using an internal dataset from an instrumented test vehicle, leveraging LiDAR as ground truth for reflectivity validation alongside image similarity metrics.
...

nuScenes Revisited: Progress and Challenges in Autonomous Driving

Preprint (2025) - Whye Kit Fong, Venice Erin Liong, Kok Seang Tan, Holger Caesar

Autonomous Vehicles (AV) and Advanced Driver Assistance Systems (ADAS) have been revolutionized by Deep Learning. As a data-driven approach, Deep Learning relies on vast amounts of driving data, typically labeled in great detail. As a result, datasets, alongside hardware and algorithms, are foundational building blocks for the development of AVs. In this work we revisit one of the most widely used autonomous driving datasets: the nuScenes dataset. nuScenes exemplifies key trends in AV development, being the first dataset to include radar data, to feature diverse urban driving scenes from two continents, and to be collected using a fully autonomous vehicle operating on public roads, while also promoting multi-modal sensor fusion, standardized benchmarks, and a broad range of tasks including perception, localization \& mapping, prediction and planning. We provide an unprecedented look into the creation of nuScenes, as well as its extensions nuImages and Panoptic nuScenes, summarizing many technical details that have hitherto not been revealed in academic publications. Furthermore, we trace how the influence of nuScenes impacted a large number of other datasets that were released later and how it defined numerous standards that are used by the community to this day. Finally, we present an overview of both official and unofficial tasks using the nuScenes dataset and review major methodological developments, thereby offering a comprehensive survey of the autonomous driving literature, with a particular focus on nuScenes. ...

MobileOcc: A Human-Aware Semantic Occupancy Dataset for Mobile Robots

Preprint (2025) - J. Kim, G. Dumont, X. Gao, Gang Cheng, Holger Caesar, J. Alonso-Mora

Dense 3D semantic occupancy perception is critical for mobile robots operating in pedestrian-rich environments, yet it remains underexplored compared to its application in autonomous driving. To address this gap, we present MobileOcc, a semantic occupancy dataset for mobile robots operating in crowded human environments. Our dataset is built using an annotation pipeline that incorporates static object occupancy annotations and a novel mesh optimization framework explicitly designed for human occupancy modeling. It reconstructs deformable human geometry from 2D images and subsequently refines and optimizes it using associated LiDAR point data. Using MobileOcc, we establish benchmarks for two tasks, i) Occupancy prediction and ii) Pedestrian velocity prediction, using different methods including monocular, stereo, and panoptic occupancy, with metrics and baseline implementations for reproducible comparison. Beyond occupancy prediction, we further assess our annotation method on 3D human pose estimation datasets. Results demonstrate that our method exhibits robust performance across different datasets. ...

VLPrompt-PSG

Vision-Language Prompting for Panoptic Scene Graph Generation

Journal article (2025) - Zijian Zhou, Holger Caesar, Qijun Chen, Miaojing Shi

Panoptic scene graph generation (PSG) aims at achieving a comprehensive image understanding by simultaneously segmenting objects and predicting relations among objects. However, the long-tail problem among relations leads to unsatisfactory results in real-world applications. Prior methods predominantly rely on vision information or utilize limited language information, such as object or relation names, thereby overlooking the utility of language information. Leveraging the recent progress in Large Language Models (LLMs), we propose to use language information to assist relation prediction, particularly for rare relations. To this end, we propose the Vision-Language Prompting (VLPrompt) model, which acquires vision information from images and language information from LLMs. Then, through a prompter network based on attention mechanism, it achieves precise relation prediction. Our extensive experiments show that VLPrompt significantly outperforms previous state-of-the-art methods on the PSG dataset, proving the effectiveness of incorporating language information and alleviating the long-tail problem of relations. Code is available at https://github.com/franciszzj/VLPrompt. ...

DPFT: Dual Perspective Fusion Transformer for Camera-Radar-Based Object Detection

Journal article (2025) - F. Fent, A. Palffy, H. Caesar

The perception of autonomous vehicles has to be efficient, robust, and cost-effective. However, cameras are not robust against severe weather conditions, lidar sensors are expensive, and the performance of radar-based perception is still inferior to the others. Camera-radar fusion methods have been proposed to address this issue, but these are constrained by the typical sparsity of radar point clouds and often designed for radars without elevation information. We propose a novel camera-radar fusion approach called Dual Perspective Fusion Transformer (DPFT), designed to overcome these limitations. Our method leverages lower-level radar data (the radar cube) instead of the processed point clouds to preserve as much information as possible and employs projections in both the camera and ground planes to effectively use radars with elevation information and simplify the fusion with camera data. As a result, DPFT has demonstrated state-of-the-art performance on the K-Radar dataset while showing remarkable robustness against adverse weather conditions and maintaining a low inference time. ...

Bosch Street Dataset: A Multi-Modal Dataset with Imaging Radar for Automated Driving

Preprint (2024) - Holger Caesar, Y. Lin, More Authors...

This paper introduces the Bosch street dataset (BSD), a novel multi-modal large-scale dataset aimed at promoting highly automated driving (HAD) and advanced driver-assistance systems (ADAS) research. Unlike existing datasets, BSD offers a unique integration of high-resolution imaging radar, lidar, and camera sensors, providing unprecedented 360-degree coverage to bridge the current gap in high-resolution radar data availability. Spanning urban, rural, and highway environments, BSD enables detailed exploration into radar-based object detection and sensor fusion techniques. The dataset is aimed at facilitating academic and research collaborations between Bosch and current and future partners. This aims to foster joint efforts in developing cutting-edge HAD and ADAS technologies. The paper describes the dataset's key attributes, including its scalability, radar resolution, and labeling methodology. Key offerings also include initial benchmarks for sensor modalities and a development kit tailored for extensive data analysis and performance evaluation, underscoring our commitment to contributing valuable resources to the HAD and ADAS research community. ...

Offline Tracking with Object Permanence

Conference paper (2024) - X. Liu, Holger Caesar

To reduce the expensive labor costs of manually labeling autonomous driving datasets, an alternative is to automatically label the datasets using an offline perception system. However, objects might be temporarily occluded. Such occlusion scenarios in the datasets are common yet underexplored in offline auto labeling. In this work, we propose an offline tracking model that focuses on occluded object tracks. It leverages the concept of object permanence, which means objects continue to exist even if they are not observed anymore. The model contains three parts: a standard online tracker, a re-identification (Re-ID) module that associates tracklets before and after occlusion, and a track completion module that completes the fragmented tracks. The Re-ID module and the track completion module use the vectorized lane map as a prior to refine the tracking results with occlusion. The model can effectively recover the occluded object trajectories. It significantly improves the original online tracking result, demonstrating its potential to be applied in offline auto labeling as a useful plugin to improve tracking by recovering occlusions. ...

UniBEV: Multi-modal 3D Object Detection with Uniform BEV Encoders for Robustness Against Missing Sensor Modalities

Conference paper (2024) - Shiming Wang, Holger Caesar, Liangliang Nan, J.F.P. Kooij

Multi-sensor object detection is an active research topic in automated driving, but the robustness of such detection models against missing sensor input (modality missing), e.g., due to a sudden sensor failure, is a critical problem which remains under-studied. In this work, we propose UniBEV, an end-to-end multi-modal 3D object detection framework designed for robustness against missing modalities: UniBEV can operate on LiDAR plus camera input, but also on LiDAR-only or camera-only input without retraining. To facilitate its detector head to handle different input combinations, UniBEV aims to create well-aligned Bird’s Eye View (BEV) feature maps from each available modality. Unlike prior BEV-based multi-modal detection methods, all sensor modalities follow a uniform approach to resample features from the original sensor coordinate systems to the BEV features. We furthermore investigate the robustness of various fusion strategies w.r.t. missing modalities: the commonly used feature concatenation, but also channel-wise averaging, and a generalization to weighted averaging termed Channel Normalized Weights. To validate its effectiveness, we compare UniBEV to state-of-the-art BEVFusion and MetaBEV on nuScenes over all sensor input combinations. In this setting, UniBEV achieves better performance than these baselines for all input combinations. An ablation study shows the robustness benefits of fusing by weighted averaging over regular concatenation, and of sharing queries between the BEV encoders of each modality. Our code is available at https://github.com/tudelft-iv/UniBEV. ...