More Robust Visual Place Recognition with Image-to-Image Augmentations from Vision Foundation Models
F. Gebben (TU Delft - Mechanical Engineering)
J.F.P. Kooij – Mentor (TU Delft - Intelligent Vehicles)
M. Zaffar – Mentor (TU Delft - Intelligent Vehicles)
S. Khademi – Graduation committee member (TU Delft - Building Knowledge)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Visual Place Recognition (VPR) remains a challenging problem, particularly under difficult conditions such as night-time or winter weather, which are often underrepresented in existing training datasets. Although transformer-based models have recently advanced the state-of-the-art, their high computational demands can hinder deployment in real-world robotic systems. This thesis proposes a new data augmentation strategy for VPR using image-to-image Vision Foundation Model InstructPix2Pix to generate realistic visual variations such as night and snow scenes from the original training data. These synthetic augmentations are added to the original training dataset to extend dataset diversity without requiring additional data collection. To further improve performance, the method is combined with more advanced augmentations using the Kornia library, which already improves robustness over traditional augmentation techniques. Experiments on multiple benchmark datasets show that lightweight, ResNet-based models trained with our VFM augmentations achieve significantly improved performance under challenging visual conditions. Additional ablations demonstrate the importance of careful prompt design and hyperparameter tuning. Overall, this work shows that VFMs can serve as practical tools for targeted dataset augmentation, improving the robustness of existing VPR methods in difficult scenarios.