MoSIS: End-to-End Self-Supervised Instance Segmentation from Video

Master Thesis (2025)
Author(s)

N.R. Dubbeldam (TU Delft - Mechanical Engineering)

Contributor(s)

T. Lentsch – Mentor (TU Delft - Intelligent Vehicles)

D. Gavrila – Mentor (TU Delft - Intelligent Vehicles)

Holger Caesar – Graduation committee member (TU Delft - Intelligent Vehicles)

Faculty
Mechanical Engineering
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
30-10-2025
Awarding Institution
Delft University of Technology
Programme
['Mechanical Engineering | Vehicle Engineering | Cognitive Robotics']
Faculty
Mechanical Engineering
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

We present MoSIS (Motion-Supervised Instance Segmentation), a self-supervised framework that learns instance masks from unlabelled video. Our method uses the movement in videos to define masks, which are then used to train a YOLO model to segment these instances. This makes the method end-to-end with minimal assumptions. Unlike methods that require motion at inference, MoSIS performs single-image instance segmentation, so it can detect objects that are currently static (e.g., vehicles waiting at a red light). To systematically evaluate our approach, we developed a controllable synthetic dataset with ground truth masks and motion fields. Overall, MoSIS shows that it can train an instance-segmentation model from unlabelled video while requiring only a single RGB frame at inference. While supervised training still attains higher BBox mAP, our label-free approach creates usable instance masks and points to a practical route for reducing annotation cost in perception systems.

Files

License info not available