Modular Neural Networks for Video Prediction

More Info
expand_more

Abstract

Modular neural networks have received an upsurge of attention lately owing to their unique modular design and potential capacity to decompose complex dynamics and learn interactions among causal variables. Inspired by this potential, we employ the recently introduced Recurrent Independent Mechanisms (RIMs) in the downstream video prediction task. RIMs consist of several modular recurrent units and modular hidden states which are called RIM cells. Those modules are connected by two attention mechanisms. Through experiments, we show that RIMs perform better or comparably with related baselines.

From modular recurrent units to modular image representations, we push the modularity further to explore how much the performance can benefit from it. We extend RIMs architecture on both the encoder and decoder sides to allow for object-centric (OC) feature representation learning in video prediction, resulting in an end-to-end architecture we refer to as OC-RIMs. Our qualitative evaluations demonstrate that every RIM cell in OC-RIMs now attends to a certain object within the input scene at any specific moment. As a result, OC-RIMs offer considerable quantitative performance improvement in video prediction over comparable baselines across two datasets.

We perform extensive ablation studies to validate the design choices of every module of RIMs. We empirically show that most modules work as expected. However, the sparse activation greatly detriments the prediction performance, which is against the claims in the paper where RIMs were proposed. On the other hand, RIM cells are expected to work near-independently. But experiments show that the use of communication mechanism leads to heavy co-adaptation between cells, i.e., RIM cells fail to make any reasonable predictions independently. Those issues have raised our concerns about the design of RIMs. Finally, we point out some future work directions to address these deficiencies.