Learning Event Representations for Vision Foundation Models for Monocular Depth Estimation

None, None

Learning Event Representations for Vision Foundation Models for Monocular Depth Estimation

Master Thesis (2026)

Author(s)

M. Jiang (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

H. Araghi – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

N. Tömen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

M. Weinmann – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Computer vision Representation learning Neuromorphic vision Event camera Vision Foundation Model Monocular Depth Estimation

To reference this document use

https://resolver.tudelft.nl/uuid:94fee874-5cdd-491e-904a-e7a5311d1515

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

25-06-2026

Awarding Institution

Delft University of Technology

Programme

Data Science and Artificial Intelligence Technology

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

9

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Event cameras are a novel sensing modality, but the lack of densely annotated datasets remains a major limitation for tasks such as monocular depth estimation. To address this, we investigate how Vision Foundation Models (VFMs), trained on large-scale RGB datasets, can be leveraged for event-based depth estimation. Previous work combines handcrafted event representations with fine-tuning of VFMs to adapt them to the event domain. In contrast, we learn an event representation while keeping the VFM frozen. We evaluate two representation learners, a U-Net and a Fully Convolutional (FullyConv) model, on DSEC and MVSEC. The results show that learned event representations are highly effective in-domain: both models outperform all baselines on DSEC, including Depth AnyEvent (DAE) and direct RGB input to Depth Anything V2 (DAv2), while the FullyConv model remains competitive on MVSEC. Cross-dataset experiments show that this improvement does not consistently transfer under domain shift. These findings indicate that learning the input representation is a strong strategy for in-domain event-based depth estimation, but that representation learning alone is not sufficient to guarantee robust cross-dataset generalization.

Files

MScThesisV2.pdf

(pdf | 24.7 Mb)