Patronus - Value Alignment of RL Agents in Text-Based Games Using MFT Profiles

Master Thesis (2026)
Author(s)

Sankalp Sagar (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Enrico Liscio – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Pradeep Murukannaiah – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Antonio Mone – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Chirag Raman – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
02-07-2026
Awarding Institution
Delft University of Technology
Programme
Data Science and Artificial Intelligence Technology
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
4
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

AI agents optimized purely for task completion often inadvertently violate human moral expectations. This is especially pronounced in reinforcement learning agents operating in text-based environments, where the richness of language and the vast action space allow for numerous harmful yet reward-maximizing paths. Existing approaches to moral alignment commonly represent morality as a single scalar signal, limiting both interpretability and the ability to model diverse moral preferences. We introduce MFT Patronus, a policy-shaping framework based on Moral Foundations Theory that represents morality across five dimensions: Care, Fairness, Loyalty, Authority, and Sanctity. This multidimensional representation enables configurable moral profiles that capture different foundational priorities. We evaluate our approach on the Jiminy Cricket benchmark and show that it maintains task performance while substantially reducing immoral behavior compared to the standard baselines and maintaining a net positive balance of moral over immoral actions. Our results further demonstrate that different moral profiles produce distinct behavioral patterns, suggesting that multidimensional moral representations are a promising direction for interpretable and configurable value alignment in reinforcement learning agents.

Files

License info not available