MoSIS: End-to-End Self-Supervised Instance Segmentation from Video

None, None

MoSIS: End-to-End Self-Supervised Instance Segmentation from Video

Master Thesis (2025)

Author(s)

N.R. Dubbeldam (TU Delft - Mechanical Engineering)

Contributor(s)

T. Lentsch – Mentor (TU Delft - Intelligent Vehicles)

D. Gavrila – Mentor (TU Delft - Intelligent Vehicles)

Holger Caesar – Graduation committee member (TU Delft - Intelligent Vehicles)

Faculty

Mechanical Engineering

Instance Segmentation YOLO End-to-End Self-supervised learning Autonomous Driving Synthetic Datasets MoSIS Photometric Reconstruction Loss YouTube-VIS 2021

To reference this document use:

https://resolver.tudelft.nl/uuid:6c514280-ab75-4409-b52e-437672a761e6

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

30-10-2025

Awarding Institution

Delft University of Technology

Programme

['Mechanical Engineering | Vehicle Engineering | Cognitive Robotics']

Faculty

Mechanical Engineering

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

We present MoSIS (Motion-Supervised Instance Segmentation), a self-supervised framework that learns instance masks from unlabelled video. Our method uses the movement in videos to define masks, which are then used to train a YOLO model to segment these instances. This makes the method end-to-end with minimal assumptions. Unlike methods that require motion at inference, MoSIS performs single-image instance segmentation, so it can detect objects that are currently static (e.g., vehicles waiting at a red light). To systematically evaluate our approach, we developed a controllable synthetic dataset with ground truth masks and motion fields. Overall, MoSIS shows that it can train an instance-segmentation model from unlabelled video while requiring only a single RGB frame at inference. While supervised training still attains higher BBox mAP, our label-free approach creates usable instance masks and points to a practical route for reducing annotation cost in perception systems.

Files

NickDubbeldam_MScRobotics_Thes... (pdf)

(pdf | 3.68 Mb)

License info not available