Exploring the Impact of PPO Cost Functions on Skill Mutation in Robotic Pouring Tasks

None, None

Exploring the Impact of PPO Cost Functions on Skill Mutation in Robotic Pouring Tasks

Master Thesis (2025)

Author(s)

J. van Buuren (TU Delft - Mechanical Engineering)

Contributor(s)

L. Peternel – Mentor (TU Delft - Human-Robot Interaction)

Micah Prendergast – Graduation committee member (TU Delft - Human-Robot Interaction)

Faculty

Mechanical Engineering

Robotics Reinforcement learning Proximal Policy Optimization (PPO) Reward Function Design Skill Mutation

To reference this document use:

https://resolver.tudelft.nl/uuid:0dc2c819-7e32-44b9-97a5-c6f84b4478cb

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

15-07-2025

Awarding Institution

Delft University of Technology

Programme

['Mechanical Engineering | Vehicle Engineering | Cognitive Robotics']

Faculty

Mechanical Engineering

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This thesis explores how deliberate modifications to reward function design in the reinforcement learning can induce skill mutations in robotic reinforcement learning, specifically within a precision pouring task. Using a simulated Franka Emika Panda robot in NVIDIA Isaac Lab, we evaluate 25 distinct reward configurations composed of weighted terms for effort, accuracy, and velocity. The resulting policies exhibit a wide range of behaviors—from fast and efficient pours to novel skills such as rim cleaning, mixing, and watering—demonstrating that small adjustments in reward structure can yield significant variations in learned strategies. Our analysis demonstrates that even small changes in reward structure can lead to significant shifts in policy behavior, facilitating both task-optimal and creative, potentially transferable strategies. To validate this concept, we implement it using the Proximal Policy Optimization (PPO) algorithm, showing that reward design alone—without altering the learning architecture—can drive meaningful skill diversification. This approach offers promising directions for adaptive control, transfer learning, and multi-objective optimization in robotic systems.

Files

Thesis_report_Jannick_van_Buur... (pdf)

(pdf | 5.88 Mb)

License info not available