The Human Factor: Addressing Diversity in Reinforcement Learning from Human Feedback
How can RLHF deal with possibly conflicting feedback?
J. PAEZ FRANCO (TU Delft - Electrical Engineering, Mathematics and Computer Science)
A. Mone – Mentor (TU Delft - Interactive Intelligence)
Luciano C. Cavalcante Siebert – Mentor (TU Delft - Interactive Intelligence)
Wendelin Böhmer – Graduation committee member (TU Delft - Sequential Decision Making)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Reinforcement Learning from Human Feedback (RLHF) is a promising approach to training agents to perform complex tasks by incorporating human feedback. However, the quality and diversity of this feedback can significantly impact the learning process. Humans are highly diverse in their preferences, expertise, and capabilities. This paper investigates the effects of conflicting feedback on the agent’s performance. We analyse the impact of environmental complexity and examine various query selection strategies. Our results show that RLHF performance rapidly degrades with even minimal conflicting feedback in simple environments, and current query selection strategies are ineffective in handling feedback diversity. We thus conclude that addressing diversity is crucial for RLHF, suggesting alternative reward modelling approaches are needed. Full code is available on GitHub.