Zero-shot learning in pick-and-place tasks using neuro-symbolic concept learning

More Info
expand_more

Abstract

Pick and place systems that operate in a warehouse setting have been studied a lot recently due to the high economic value for e-commerce companies. In this thesis, the focus is on the perception pipeline that performs object recognition given a certain input data stream (typically RGB-D images). Impressive results regarding object recognition have been reported in the last years, mainly driven by the development of convolutional neural networks. However, only very few proposed perception pipelines are suitable to adapt quickly to recognize new objects. This is considered a problem since large e-commerce companies add many new products to their inventory every day. In this thesis, efforts are made to solve this problem by proposing the use of a neuro-symbolic model in which concept learning is combined with symbolic reasoning. First, visual attributes are obtained from an input image by passing it through the neural part of the model, which consists of two β-Variational Autoencoders. The key element of the model is that the visual attributes can be recognized even if a particular combination of visual attributes have not been in the training dataset. Next, given a knowledge base and the visual attributes as inferred from the input image, a symbolic reasoner infers the most likely object ID. Hereby, the knowledge base is manually constructed and describes the relationships between object IDs and the corresponding visual attributes. The implementation of the neuro-symbolic model is first tested on a synthetic dataset which is similar to the dataset as used in the study that the neural part is based upon. Thereafter, the neuro-symbolic model is tested on real RGB-D images of the pick-and-place dataset and several iterations of the baseline model are evaluated. Hereby, the main research question is formulated as: Can a neuro-symbolic model be used to recognize unseen objects given RGB-D data as typically seen in pick and place scenarios? The best top-1 accuracy score on unseen images of the synthetic dataset was 79.5%. However, using the same neuro-symbolic model on the pick-and-place dataset, the top-1 accuracy score on unseen images was only 25.5%. In the following iterations of the model, the top-1 accuracy score was improved up to 32.4%. Analysis of the results of the pick-and-place experiments shows that the neural part is not very capable of recognizing the correct visual attributes. This likely be due (to some extend) to the simulation-to-real gap. However, further research is required to identify the exact cause(s) of the performance drop. Concludingly, the proposed neuro-symbolic model is capable of recognizing unseen images of the synthetic dataset, but is not very capable of recognizing unseen images of the pick-and-place dataset.