We propose GIM, a self-training framework for learning a single
generalizable model based on any image
matching architecture using internet videos, an abundant and diverse
data source.
Image matching is a fundamental computer vision problem.
While learning-based methods achieve state-of-the-art performance on
existing benchmarks, they generalize
poorly to in-the-wild images.
Such methods typically need to train separate models for different
scene types (e.g., indoor vs. outdoor)
and are impractical when the scene type is unknown in advance.
One of the underlying problems is the limited scalability of
existing data construction pipelines, which
limits the diversity of standard image matching datasets.
To address this problem, we propose GIM, a self-training framework
for learning a single generalizable model
based on any image matching architecture using internet videos, an
abundant and diverse data source.
Not relying on complex 3D reconstruction makes GIM much more
efficient and less likely to fail than standard
SfM-and-MVS based frameworks.
Experiments demonstrate the effectiveness and generality of GIM.
Applying GIM consistently improves the zero-shot performance of 3
state-of-the-art image matching
architectures as the number of downloaded videos increases.