Multi-expert Preference Alignment in Reinforcement Learning
More Info
expand_more
Abstract
This project explores adaptation to preference shifts in Multi-objective Reinforcement Learning (MORL), with a focus on how Reinforcement Learning (RL) agents can align with the preferences of multiple experts. This alignment can occur across various scenarios featuring distinct preferences of experts or within a single scenario that experiences a shift in preferences. Unlike traditional RL, which requires retraining policies every time an individual expert's preference is introduced—resulting in high computational complexity and impracticality—this project proposes a single-policy RL algorithm named Generalized Preference-based PPO (GPB PPO). This algorithm integrates environmental information and experts' preference requirements throughout the decision-making process. By exposing the agent to diverse preference scenarios during training, it learns a policy conditional on preference and can generalize to any given preference. This method eliminates the need for explicit retraining and additional adaptation when preferences shift. The generalization and adaptation capabilities of GPB PPO are further evaluated under both stationary and non-stationary environments.