GIM: Learning Generalizable Image Matcher From Internet Videos

🌟 ICLR 2024 Spotlight (top-5%)


Xuelun Shen 1, Zhipeng Cai 📧 2, Wei Yin 3, Matthias Müller 2, Zijun Li 1, Kaixuan Wang 3 Xiaozhi Chen 3 Cheng Wang 📧 1

† Equal Contribution, 📧 Corresponding Author

1 Xiamen University, 2 Intel, 3 DJI

Overview video (9 min)


Abstract

We propose GIM, a self-training framework for learning a single generalizable model based on any image matching architecture using internet videos, an abundant and diverse data source.

Image matching is a fundamental computer vision problem. While learning-based methods achieve state-of-the-art performance on existing benchmarks, they generalize poorly to in-the-wild images. Such methods typically need to train separate models for different scene types (e.g., indoor vs. outdoor) and are impractical when the scene type is unknown in advance. One of the underlying problems is the limited scalability of existing data construction pipelines, which limits the diversity of standard image matching datasets. To address this problem, we propose GIM, a self-training framework for learning a single generalizable model based on any image matching architecture using internet videos, an abundant and diverse data source. Not relying on complex 3D reconstruction makes GIM much more efficient and less likely to fail than standard SfM-and-MVS based frameworks.

Experiments demonstrate the effectiveness and generality of GIM. Applying GIM consistently improves the zero-shot performance of 3 state-of-the-art image matching architectures as the number of downloaded videos increases.


Two-view Matching (no RANSAC) and Reconstruction

Input Images

Image A

SuperGlue

Image B

LoFTR

Image C

DKM

Image D

GIMDKM

Image E

GIMDKM

DKM

Input Images

Image A

SuperGlue

Image B

LoFTR

Image C

DKM

Image D

GIMDKM

Image E

GIMDKM

DKM

Input Images

Image A

SuperGlue

Image B

LoFTR

Image C

DKM

Image D

GIMDKM

Image E

GIMDKM

DKM

Input Images

Image A

SuperGlue

Image B

LoFTR

Image C

DKM

Image D

GIMDKM

Image E

GIMDKM

DKM

Input Images

Image A

SuperGlue

Image B

LoFTR

Image C

DKM

Image D

GIMDKM

Image E

GIMDKM

DKM


Multi-view Reconstruction

We replaced SIFT in Colmap with the two methods (DKM and GIM), and then ran SfM and MVS.

Input Images (Sampled)

Image A
Image B
Image C
Image D
Image E
Image F

COLMAP (SIFT)
COLMAP + DKM
COLMAP + GIMDKM

Rendered Video from Gaussian Splatting

We used the results of Multi-view Reconstruction to train Gaussian Splatting, and then rendered it to video. Since gim can generate more accurate poses, we get better rendering results.

Input Images (Sampled)

Image A
Image B
Image C
Image D
Image E
Image F

COLMAP (SIFT)

COLMAP + DKM

COLMAP + GIMDKM


Input Images (Sampled)

Image A
Image B
Image C
Image D
Image E
Image F

COLMAP (SIFT)

COLMAP + DKM

COLMAP + GIMDKM


Input Images (Sampled)

Image A
Image B
Image C
Image D
Image E
Image F

COLMAP (SIFT)

COLMAP + DKM

COLMAP + GIMDKM

Failed


Emergent Ability

We also conducted matching on images that we had never encountered in training, and even though their styles and modalities are completely different from the training data, the zero-shot generalization ability of GIM can still handle them.

BEV images

A pair of BEV images

Match result

Match result

Warp result

Warp result

Remote sensing and BEV images

A pair of remote sensing and LiDAR BEV images

Match result

Match result

Warp result

Warp result

Remote sensing and aerial images

A pair of remote sensing and aerial images

Match result

Match result

Warp result

Warp result

Poster

Poster

@inproceedings{
xuelun2024gim,
title={GIM: Learning Generalizable Image Matcher From Internet Videos},
author={Xuelun Shen and Zhipeng Cai and Wei Yin and Matthias Müller and Zijun Li and Kaixuan Wang and Xiaozhi Chen and Cheng Wang},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024}
}