Multi-expert Preference Alignment in Reinforcement Learning

None, None

Multi-expert Preference Alignment in Reinforcement Learning

Master Thesis (2024)

Author(s)

L. Li (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

L. Cavalcante Siebert – Mentor (TU Delft - Interactive Intelligence)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Reinforcement Learning Adaptation Multi-Objective Decision-Making

To reference this document use:

https://resolver.tudelft.nl/uuid:bb9c9637-3b26-4d41-aad6-dfcdac928509

More Info

expand_more

Publication Year

2024

Language

English

Copyright

Graduation Date

06-02-2024

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This project explores adaptation to preference shifts in Multi-objective Reinforcement Learning (MORL), with a focus on how Reinforcement Learning (RL) agents can align with the preferences of multiple experts. This alignment can occur across various scenarios featuring distinct preferences of experts or within a single scenario that experiences a shift in preferences. Unlike traditional RL, which requires retraining policies every time an individual expert's preference is introduced—resulting in high computational complexity and impracticality—this project proposes a single-policy RL algorithm named Generalized Preference-based PPO (GPB PPO). This algorithm integrates environmental information and experts' preference requirements throughout the decision-making process. By exposing the agent to diverse preference scenarios during training, it learns a policy conditional on preference and can generalize to any given preference. This method eliminates the need for explicit retraining and additional adaptation when preferences shift. The generalization and adaptation capabilities of GPB PPO are further evaluated under both stationary and non-stationary environments.

Files

Multi-expert_Preference_Alignm... (pdf)

(pdf | 3.17 Mb)

License info not available