On the Generalization of Metric Relative Pose Estimation Models to Unseen Environments

Master Thesis (2025)
Author(s)

B. Jangley (TU Delft - Mechanical Engineering)

Contributor(s)

Julian F.P. Kooij – Mentor (TU Delft - Intelligent Vehicles)

Christian Pek – Graduation committee member (TU Delft - Robot Dynamics)

M. Zaffar – Graduation committee member (TU Delft - Intelligent Vehicles)

Faculty
Mechanical Engineering
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
26-09-2025
Awarding Institution
Delft University of Technology
Programme
['Mechanical Engineering | Vehicle Engineering | Cognitive Robotics']
Faculty
Mechanical Engineering
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Abstract—Crowd-sourced imagery is increasingly important for urban mapping and visual localization. However, its reliability is limited by GPS inaccuracies and heterogeneous capture condi- tions, including device variability, viewpoint differences, illumi- nation changes, and temporal shifts. In these settings, achieving metric-scale pose estimation remains a central challenge. Deep Learning-based pose estimation models address this problem by learning to estimate the 6-DoF pose using geometric cues between image views and metric supervision during training on large datasets. This encourages spatial consistency and sup- ports generalization across diverse conditions. Recent learning- based architectures, often based on vision transformer encoders, approach the task through unified multi-task frameworks that jointly predict metric depthmaps and 2D–2D correspondences, with relative pose estimated downstream. This thesis evaluates whether such frameworks predict accurate metric depthmaps under domain shifts. Experiments show that, even with scale correction through data-driven fine-tuning with metric supervi- sion, depth predictions from multi-task relative pose estimation models fail to generalize reliably to out-of-domain environments. In contrast, monocular models, trained on significantly larger and more varied datasets, demonstrate strong zero-shot reliability for metric depth prediction. A hybrid pipeline is proposed that combines the geometric consistency of relative pose models with the stable metric cues of monocular models, enabling robust pose estimation in crowd-sourced outdoor environments.

Files

License info not available