Z. Xia | TU Delft Repository

Cross-View Camera Pose Estimation by Matching Local Features in 3D

Master thesis (2023) - S. Voloshyn, Z. Xia, J.F.P. Kooij, D. Gavrila

This work addresses visual localization of intelligent vehicles as an alternative to traditional GPS- of HD map-based localization options. Specifically, the problem of Cross-View Pose Estimation (CVPE) is explored, which involves estimating the vehicle pose within an encompassing aerial patch, given a ground image from the on-board camera feed. The aerial patch containing the ground truth pose can be obtained through a rough localization prior, such as GPS. We find that existing CVPE methods start with a location prior that is too coarse given both the GPS performance and the required localization error. Therefore, we define a fine-grained localization setting and propose three approaches, targeting performance, interpretability, and data efficiency. Furthermore, the approaches have a unique capacity to predict a 6-DoF camera pose. Two approaches involve matching point-level local features in 3D space using a novel point cross-attention, while the last one aims to tailor an existing dense feature matching method to the fine-grained setting. Despite quantitative performance of the local feature matching approaches being inferior to the state-of-the-art, we establish a new state-of-the-art on the fine-grained setting with the improved dense-feature baseline. Nevertheless, we show the key limitations of the local feature matching, namely the influence of the “unmatchable” queries. Furthermore, using a 6-DoF projective transformation we discover severe issues with the ground truth quality on the KITTI dataset, commonly used in CVPE literature, potentially accounting to the large degree to the substandard performance of most available CVPE methods. Finally, our local feature matching methods demonstrate the capability of predicting pitch and roll angles of the camera, estimating which has not yet been attempted in CVPE. ...

Improving Cross-View Matching with Self-Supervised Learning

Master thesis (2023) - J. Cui, Z. Xia, J.F.P. Kooij, L. Nan

We explored the possibility of improving cross-view matching performance with self-supervised learning techniques and perform interpretations in terms of the embedding space of image features. The effect of pre-training by contrastive learning is verified quantitatively by experiments, and also exhibited by visualization of the feature space. ...

SliceNet: Street-to-Satellite Image Metric Localization using Local Feature Matching

Master thesis (2022) - T. de Vries Lentsch, J.F.P. Kooij, Z. Xia, H.C. Caesar, S. Khademi

This work addresses visual localization for intelligent vehicles. The task of cross-view matching-based localization is to estimate the geo-location of a vehicle-mounted camera by matching the captured street view image with an overhead-view satellite map containing the vehicle's local surroundings. This local satellite view image can be obtained using any rough localization prior, e.g., from a global navigation satellite system or temporal filtering. Existing cross-view matching methods are global image descriptor-based and achieve considerably lower localization performance than structure-based methods with 3D maps. Whereas structure-based methods utilized global image descriptors in the past, recent structure-based work has shown that significantly better localization performance can be achieved using local image descriptors to find pixel-level correspondences between the query street view image and the 3D map. Hence, using local image descriptors may be the key to improving the localization performance of cross-view matching methods. However, the street and the satellite view do exhibit not only very different visual appearances but also have distinctive geometric configurations. As a result, finding correspondences between the two views is not a trivial task. We observe that the geometric relationship between the street and satellite view implies that every vertical line in the street view image has a corresponding azimuth direction in the satellite view image. Based on this prior, we devise a novel neural network architecture called SliceNet that extracts local image descriptors from both images and matches these to compute a dense spatial distribution for the camera's location. Specifically, the geometric prior is used as a weakly supervised signal to enable SliceNet to learn the correspondences between the two views. As an additional task, we also show that the extracted local image descriptors can be used to determine the heading of the camera. SliceNet outperforms global image descriptor-based cross-view matching methods and achieves state-of-the-art localization results on the VIGOR dataset. Notably, the proposed method reduces the median metric localization error by 21% and 4% compared to the state-of-the-art methods when generalizing, respectively, in the same area and across areas. ...

This work addresses visual localization for intelligent vehicles. The task of cross-view matching-based localization is to estimate the geo-location of a vehicle-mounted camera by matching the captured street view image with an overhead-view satellite map containing the vehicle's local surroundings. This local satellite view image can be obtained using any rough localization prior, e.g., from a global navigation satellite system or temporal filtering. Existing cross-view matching methods are global image descriptor-based and achieve considerably lower localization performance than structure-based methods with 3D maps. Whereas structure-based methods utilized global image descriptors in the past, recent structure-based work has shown that significantly better localization performance can be achieved using local image descriptors to find pixel-level correspondences between the query street view image and the 3D map. Hence, using local image descriptors may be the key to improving the localization performance of cross-view matching methods. However, the street and the satellite view do exhibit not only very different visual appearances but also have distinctive geometric configurations. As a result, finding correspondences between the two views is not a trivial task. We observe that the geometric relationship between the street and satellite view implies that every vertical line in the street view image has a corresponding azimuth direction in the satellite view image. Based on this prior, we devise a novel neural network architecture called SliceNet that extracts local image descriptors from both images and matches these to compute a dense spatial distribution for the camera's location. Specifically, the geometric prior is used as a weakly supervised signal to enable SliceNet to learn the correspondences between the two views. As an additional task, we also show that the extracted local image descriptors can be used to determine the heading of the camera. SliceNet outperforms global image descriptor-based cross-view matching methods and achieves state-of-the-art localization results on the VIGOR dataset. Notably, the proposed method reduces the median metric localization error by 21% and 4% compared to the state-of-the-art methods when generalizing, respectively, in the same area and across areas.