Data efficient pick location learning from images for sequential tasks

More Info
expand_more

Abstract

Learning from demonstration is a technique where the robot learns directly from humans. It can be beneficial to learn from humans directly because humans can easily demonstrate complex behaviors without being experts in demonstrating required tasks. However, it can be challenging to gather large amounts of data from humans because humans often get tired, bored, or lose focus while giving many demonstrations. Therefore, the learning algorithms require to be as data-efficient as possible. To solve sequential tasks using few human demonstrations, the Transporter Network was introduced.

The Transporter Network consists of two networks, namely the pick network and the place network. Both these networks have a fully convolutional architecture which has the property of being translationally equivariant. Translational equivariance is well suited for estimating pick locations because the predicted pick location changes corresponding to the change in the object's position in the image. However, the performance of the transporter network on sequential tasks depends on the receptive field of the fully convolutional network because each pixel in the output should depend on each pixel in the input image. This correlation is important for the network to make predictions based on the overall configuration of objects in the input image. A larger receptive field would give a prediction by correlating large amount of pixels in the input image, however increasing the receptive field could result in loss of resolution which in turn leads to information loss in the latent space. Therefore, it is essential to select an appropriate size for the receptive field such that there is minimal loss of resolution in the latent space of the fully convolutional network. This limitation of selecting the appropriate size of the receptive field was also observed in the preliminary experiments on the pick network of the Transporter Network.

In this work, a SEquential Attention Network (SEA Net) is introduced to overcome the dependence of the Transporter Network on the receptive field size for solving sequential task. SEA Net is a variation of the pick network of the Transporter Network. SEA Net works under the assumption that the set of objects to be picked are known beforehand. Based on this assumption, the SEA Net learns to predict two things: 1) The pick location of all the objects. 2) "what" object to pick as a function of the current configuration of the objects in the environment. SEA Nets can be used independently to predict pick location sequentially and also be used as a replacement of the pick network in the Transporter Network. SEA Net is evaluated on two datasets, namely synthetic dataset and simulated robot dataset. The synthetic dataset has a top-down view of simple shape blocks. In contrast, simulated robot datasets are extracted using a pre-trained policy on the Mujoco simulators. The Object Keypoint Similarity (OKS) metrics is used to score the distance between the predicted pick point and the ground truth. The OKS Metric uses a standard deviation threshold (σ(threshold)). A strict standard deviation threshold (σ(threshold)) of 2 pixels and a lenient threshold (σ(threshold)) of 20 pixels is used because a strict threshold will evaluate the resolution of the prediction. In contrast, the lenient threshold will assess if the predicted pick point is, at least on the desired object. The model achieved an overall accuracy of 64% for a strict standard deviation threshold (σ(threshold)) of 2 pixels and accuracy of 85% for a lenient standard deviation threshold (σ(threshold)) of 20 pixels on the OKS metrics, on the synthetic dataset. The model could perform on a simulated robot dataset with an accuracy of 82% on a strict standard deviation threshold (σ(threshold)) of 2 pixels.