Learning a Policy from User Preferences
An Interactive Approach to Multi-Objective Reinforcement Learning
H. Zeng (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Z. Osika – Mentor (TU Delft - Policy Analysis)
P.K. Murukannaiah – Mentor (TU Delft - Interactive Intelligence)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Many real-life problems are complex due to their multi-objective nature. Over the past decade, there has been growing research on Multi-Objective Reinforcement Learning (MORL) problems, which simulate the complexities of real-life scenarios. Because there are multiple objectives to be optimized, the majority of the MORL methods focus on providing a dense set of solutions called the Pareto Front as a result. The issues with the current approaches are that generating a large solution set requires high computational costs, and it can still be difficult for the user to find their most preferred solutions from a large solution set. In this research, we propose an interactive MORL method where the user is asked for their preferred solution in every iteration from the current solution set, and the algorithm utilizes this information to enhance its learning process to find preference-aligned solutions. This is achieved by bounding the solution space to only search for new policies that outperform the previously user-selected solution within these bounds. We evaluate our method using an artificial user function to simulate preferences, comparing it with non-interactive MORL methods. Metrics to compare the quality of solutions include the number of learning steps required to converge to a preferred solution, the value achieved on the artificial user function. The results demonstrate that the interactive method provides a dense set of solutions in the user’s region of interest, and it tends to converge faster towards the user’s preferred solution.