Learning state representations for robotic control

Information disentangling and multi-modal learning

More Info
expand_more

Abstract

Representation learning is a central topic in the field of deep learning. It aims at extracting useful state representations directly from raw data. In deep learning, state representations are usually used for classification or inferences. For example, image embedding that provides similarity metrics can be used for face recognition. Recent success on deep learning has stimulated the interest of applying state representation learning to control. This problem is very different from deep learning in computer visions and language modelling. Control tasks usually require pose information about the task relevant object in the scene. Applying state representation learning to control requires such features to be embedded. This is a difficult problem since there is generally no explicit supervision for extracting pose information from the data.

A problem of state representation learning for control is that raw data can contain much irrelevant information for the control task. For example, colors, textures and shapes of task-relevant objects are not as important as poses of objects. This suggests that the appearance and pose information should be disentangled in learned representations and the pose representations should be used for the control task. Furthermore, we usually need a system dynamic model in order to perform model-based optimal control. The prediction ability of the model is crucial for planning since it has to evaluate future trajectories and optimizes the actions. Thus we need to learn a decent dynamic model on extracted representations.

To address these problems, we first propose to use Variational Auto-encoders to disentangle the pose and appearance representations. The benefit of preserving both appearance and pose representations is that we can predict pose representations conditioned on actions. From appearance and predicted pose representations, we can reconstruct and predict future image frames. For this purpose, we introduce the Long Short-Term Memory network to learn a prediction model on the pose representations. We further intend to apply multi-modal state representation learning for control. Leveraging the idea of sensor fusions, we want to verify whether multi-modal sensory data lead to better representations for control.

To test our hypothesis, an object poking simulation is prepared in the Gazebo simulator where training and testing data-sets are collected. We validate disentangled representations and prediction performances with a few baseline models. Experiment results show that information disentangling can be achieved without explicit supervisions. The prediction model can effectively predict future frames a few steps ahead conditioned on poking actions. For multi-modal learning, we apply a Siamese network for inverse planning on the object poking task. We test multi-modal representations with the Siamese network and report improvements in online simulations.