We present MoSIS (Motion-Supervised Instance Segmentation), a self-supervised framework that learns instance masks from unlabelled video. Our method uses the movement in videos to define masks, which are then used to train a YOLO model to segment these instances. This makes the m
...
We present MoSIS (Motion-Supervised Instance Segmentation), a self-supervised framework that learns instance masks from unlabelled video. Our method uses the movement in videos to define masks, which are then used to train a YOLO model to segment these instances. This makes the method end-to-end with minimal assumptions. Unlike methods that require motion at inference, MoSIS performs single-image instance segmentation, so it can detect objects that are currently static (e.g., vehicles waiting at a red light). To systematically evaluate our approach, we developed a controllable synthetic dataset with ground truth masks and motion fields. Overall, MoSIS shows that it can train an instance-segmentation model from unlabelled video while requiring only a single RGB frame at inference. While supervised training still attains higher BBox mAP, our label-free approach creates usable instance masks and points to a practical route for reducing annotation cost in perception systems.