Modular Neural Networks for Video Prediction

Master Thesis (2022)
Author(s)

N. Lin (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J.H.G. Dauwels – Mentor (TU Delft - Signal Processing Systems)

H. Jamali-Rad – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2022 Nan Lin
More Info
expand_more
Publication Year
2022
Language
English
Copyright
© 2022 Nan Lin
Graduation Date
26-08-2022
Awarding Institution
Delft University of Technology
Programme
['Electrical Engineering']
Related content

Project Github Repository

https://github.com/sentient-codebot/RIM-MovingMNIST
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Modular neural networks have received an upsurge of attention lately owing to their unique modular design and potential capacity to decompose complex dynamics and learn interactions among causal variables. Inspired by this potential, we employ the recently introduced Recurrent Independent Mechanisms (RIMs) in the downstream video prediction task. RIMs consist of several modular recurrent units and modular hidden states which are called RIM cells. Those modules are connected by two attention mechanisms. Through experiments, we show that RIMs perform better or comparably with related baselines.

From modular recurrent units to modular image representations, we push the modularity further to explore how much the performance can benefit from it. We extend RIMs architecture on both the encoder and decoder sides to allow for object-centric (OC) feature representation learning in video prediction, resulting in an end-to-end architecture we refer to as OC-RIMs. Our qualitative evaluations demonstrate that every RIM cell in OC-RIMs now attends to a certain object within the input scene at any specific moment. As a result, OC-RIMs offer considerable quantitative performance improvement in video prediction over comparable baselines across two datasets.

We perform extensive ablation studies to validate the design choices of every module of RIMs. We empirically show that most modules work as expected. However, the sparse activation greatly detriments the prediction performance, which is against the claims in the paper where RIMs were proposed. On the other hand, RIM cells are expected to work near-independently. But experiments show that the use of communication mechanism leads to heavy co-adaptation between cells, i.e., RIM cells fail to make any reasonable predictions independently. Those issues have raised our concerns about the design of RIMs. Finally, we point out some future work directions to address these deficiencies.

Files

MSc_Thesis_Nan_final.pdf
(pdf | 8.36 Mb)
License info not available