Current visuomotor manipulators jointly train their perception-planning-action components simultaneously using an end-to-end framework to avoid hand-engineering components. Despite this, methods for human-to-robot object handover tasks require a perception component that segments
...
Current visuomotor manipulators jointly train their perception-planning-action components simultaneously using an end-to-end framework to avoid hand-engineering components. Despite this, methods for human-to-robot object handover tasks require a perception component that segments the hand from the object, which can introduce error propagation. For this reason, this study investigates the applicability of an end-to-end framework that eliminates the need for hand-object segmentation in a simulated human-to-robot object handover task using HandoverSim.
To address this, a behavior cloning agent is used to convert camera input into RGB-D voxel space and output discretized 6-DoF manipulation to directly discover features for the handover task. This study introduces a framework that combines the behavior cloning agent with the HandoverSim, which allows experimenting with various training configurations. These configurations consist of experiments with: 1) expert demonstration data; 2) optimal camera setup; 3) handover objects; and 4) voxel-based RGB augmentation techniques.
The trained model is evaluated on its generalization to diverse handover conditions in the HandoverSim Benchmark. The results demonstrate that the behavior cloning agent can learn features for the handover task without requiring a perception component. The model learns the grasp-object relation whilst minimizing contact with the hand. Despite this, performance is limited by sparse training data and grasping accuracy.