MoSIS: End-to-End Self-Supervised Instance Segmentation from Video
N.R. Dubbeldam (TU Delft - Mechanical Engineering)
T. Lentsch – Mentor (TU Delft - Intelligent Vehicles)
D. Gavrila – Mentor (TU Delft - Intelligent Vehicles)
Holger Caesar – Graduation committee member (TU Delft - Intelligent Vehicles)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
We present MoSIS (Motion-Supervised Instance Segmentation), a self-supervised framework that learns instance masks from unlabelled video. Our method uses the movement in videos to define masks, which are then used to train a YOLO model to segment these instances. This makes the method end-to-end with minimal assumptions. Unlike methods that require motion at inference, MoSIS performs single-image instance segmentation, so it can detect objects that are currently static (e.g., vehicles waiting at a red light). To systematically evaluate our approach, we developed a controllable synthetic dataset with ground truth masks and motion fields. Overall, MoSIS shows that it can train an instance-segmentation model from unlabelled video while requiring only a single RGB frame at inference. While supervised training still attains higher BBox mAP, our label-free approach creates usable instance masks and points to a practical route for reducing annotation cost in perception systems.