We introduce LoopGNN, a GNN that estimates loop closure consensus by leveraging sets of similar keyframes retrieved through visual place recognition (VPR). This increases the robustness of predicted loop closures and allows for a considerable effiency boost over classical VPR + RANSAC geometric verification pipelines.

Visual loop closure detection traditionally relies on place recognition methods to retrieve candidate loops that are validated using computationally expensive RANSAC-based geometric verification. As false positive loop closures significantly degrade downstream pose graph estimates, verifying a large number of candidates in online simultaneous localization and mapping scenarios is constrained by limited time and compute resources. While most deep loop closure detection approaches only operate on pairs of keyframes, we relax this constraint by considering neighborhoods of multiple keyframes when detecting loops. In this work, we introduce LoopGNN, a graph neural network architecture that estimates loop closure consensus by leveraging cliques of visually similar keyframes retrieved through place recognition. By propagating deep feature encodings among nodes of the clique, our method yields high precision estimates while maintaining high recall. Extensive experimental evaluations on the TartanDrive 2.0 and NCLT datasets demonstrate that LoopGNN outperforms traditional baselines. Additionally, an ablation study across various keypoint extractors demonstrates that our method is robust, regardless of the type of deep feature encodings used, and exhibits higher computational efficiency compared to classical geometric verification baselines.

Approach

Overview of our approach
LoopGNN approach: We create keyframes from robot trajectories and utilize a deep keypoint extractor such as XFeat [25] to obtain keypoints for each image. Next, we fit a VLAD-based place recognition model allowing robust and fast retrieval of similar frames given a query frame (left). In the following, given a query frame, we independently encode the keypoint descriptors of all frames (query and retrieved ones) using a NetVLAD layer and construct a neighborhood graph. We feed this attributed graph into a graph attention network in order to produce a deep consensus regarding loop closures among keyframes of the neighborhood (middle). Finally, we extract the set of highest-scoring edge-wise predictions of the network and validate pairs of frames using RANSAC-based geometric verification (right).

Code

This work is released under CC BY-NC-SA license. A software implementation of this project can be found on GitHub (Coming soon).

Authors

Martin Büchner

Martin Büchner

University of Freiburg

Liza Dahiya

Liza Dahiya

Honda R&D

Simon Dorer

Simon Dorer

University of Freiburg

Vipul Ramtekkar

Vipul Ramtekkar

Honda R&D

Kenji Nishimiya

Kenji Nishimiya

Honda R&D

Daniele Cattaneo

Daniele Cattaneo

University of Freiburg

Abhinav Valada

Abhinav Valada

University of Freiburg

Martin Büchner,   Liza Dahiya,   Simon Dorer,   Vipul Ramtekkar,   Kenji Nishimiya,   Daniele Cattaneo,   Abhinav Valada

Acknowledgment

This work was funded by Honda Research and Development and an academic grant from NVIDIA.