Modular Neural Networks for Video Prediction

Master thesis (2022)

Authors

N. Lin Electrical Engineering, Mathematics and Computer Science

Contributors

J.H.G. Dauwels Signal Processing Systems - (supervisor 1)

H. Jamali-Rad Pattern Recognition and Bioinformatics - (supervisor 2)

Faculty

Electrical Engineering, Mathematics and Computer Science

Deep learning Object-centric Video prediction Modular neural network

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:f7c050b2-25de-4645-a0ff-244adf65d207

Published Date

26-08-2022

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Modular neural networks have received an upsurge of attention lately owing to their unique modular design and potential capacity to decompose complex dynamics and learn interactions among causal variables. Inspired by this potential, we employ the recently introduced Recurrent Independent Mechanisms (RIMs) in the downstream video prediction task. RIMs consist of several modular recurrent units and modular hidden states which are called RIM cells. Those modules are connected by two attention mechanisms. Through experiments, we show that RIMs perform better or comparably with related baselines.

From modular recurrent units to modular image representations, we push the modularity further to explore how much the performance can benefit from it. We extend RIMs architecture on both the encoder and decoder sides to allow for object-centric (OC) feature representation learning in video prediction, resulting in an end-to-end architecture we refer to as OC-RIMs. Our qualitative evaluations demonstrate that every RIM cell in OC-RIMs now attends to a certain object within the input scene at any specific moment. As a result, OC-RIMs offer considerable quantitative performance improvement in video prediction over comparable baselines across two datasets.

We perform extensive ablation studies to validate the design choices of every module of RIMs. We empirically show that most modules work as expected. However, the sparse activation greatly detriments the prediction performance, which is against the claims in the paper where RIMs were proposed. On the other hand, RIM cells are expected to work near-independently. But experiments show that the use of communication mechanism leads to heavy co-adaptation between cells, i.e., RIM cells fail to make any reasonable predictions independently. Those issues have raised our concerns about the design of RIMs. Finally, we point out some future work directions to address these deficiencies.

Files

MSc_Thesis_Nan_final.pdf

(.pdf | 8.36 Mb)