Learning a Policy from User Preferences

None, None

Learning a Policy from User Preferences

An Interactive Approach to Multi-Objective Reinforcement Learning

Master Thesis (2025)

Author(s)

H. Zeng (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Z. Osika – Mentor (TU Delft - Policy Analysis)

P.K. Murukannaiah – Mentor (TU Delft - Interactive Intelligence)

Faculty

Electrical Engineering, Mathematics and Computer Science

Reinforcement Learning (RL) Multi Objective Optimisation Human Feedback Interactive Optimization

To reference this document use:

https://resolver.tudelft.nl/uuid:68a94433-5edb-43a0-86f1-f9df5a0f90bf

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

08-09-2025

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Many real-life problems are complex due to their multi-objective nature. Over the past decade, there has been growing research on Multi-Objective Reinforcement Learning (MORL) problems, which simulate the complexities of real-life scenarios. Because there are multiple objectives to be optimized, the majority of the MORL methods focus on providing a dense set of solutions called the Pareto Front as a result. The issues with the current approaches are that generating a large solution set requires high computational costs, and it can still be difficult for the user to find their most preferred solutions from a large solution set. In this research, we propose an interactive MORL method where the user is asked for their preferred solution in every iteration from the current solution set, and the algorithm utilizes this information to enhance its learning process to find preference-aligned solutions. This is achieved by bounding the solution space to only search for new policies that outperform the previously user-selected solution within these bounds. We evaluate our method using an artificial user function to simulate preferences, comparing it with non-interactive MORL methods. Metrics to compare the quality of solutions include the number of learning steps required to converge to a preferred solution, the value achieved on the artificial user function. The results demonstrate that the interactive method provides a dense set of solutions in the user’s region of interest, and it tends to converge faster towards the user’s preferred solution.

Files

Master_Thesis_H_Zeng.pdf

(pdf | 7.9 Mb)

License info not available