End-to-end behavior cloning agent for an object handover task

Training and evaluating a robot to perform simulated human-to-robot object handovers without requiring hand-object segmentation

Master Thesis (2025)
Author(s)

Y. Watabe (TU Delft - Mechanical Engineering)

Contributor(s)

Yke B. B. Eisma – Mentor (TU Delft - Human-Robot Interaction)

Y.B. Eisma – Graduation committee member (TU Delft - Human-Robot Interaction)

D. Dodou – Graduation committee member (TU Delft - Medical Instruments & Bio-Inspired Technology)

R. Zhang – Graduation committee member (TU Delft - Human-Robot Interaction)

Faculty
Mechanical Engineering
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
28-04-2025
Awarding Institution
Delft University of Technology
Programme
['Mechanical Engineering | Vehicle Engineering | Cognitive Robotics']
Faculty
Mechanical Engineering
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Current visuomotor manipulators jointly train their perception-planning-action components simultaneously using an end-to-end framework to avoid hand-engineering components. Despite this, methods for human-to-robot object handover tasks require a perception component that segments the hand from the object, which can introduce error propagation. For this reason, this study investigates the applicability of an end-to-end framework that eliminates the need for hand-object segmentation in a simulated human-to-robot object handover task using HandoverSim.

To address this, a behavior cloning agent is used to convert camera input into RGB-D voxel space and output discretized 6-DoF manipulation to directly discover features for the handover task. This study introduces a framework that combines the behavior cloning agent with the HandoverSim, which allows experimenting with various training configurations. These configurations consist of experiments with: 1) expert demonstration data; 2) optimal camera setup; 3) handover objects; and 4) voxel-based RGB augmentation techniques.

The trained model is evaluated on its generalization to diverse handover conditions in the HandoverSim Benchmark. The results demonstrate that the behavior cloning agent can learn features for the handover task without requiring a perception component. The model learns the grasp-object relation whilst minimizing contact with the hand. Despite this, performance is limited by sparse training data and grasping accuracy.

Files

License info not available