Patronus - Value Alignment of RL Agents in Text-Based Games Using MFT Profiles

None, None

Patronus - Value Alignment of RL Agents in Text-Based Games Using MFT Profiles

Master Thesis (2026)

Author(s)

Sankalp Sagar (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Enrico Liscio – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Pradeep Murukannaiah – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Antonio Mone – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Chirag Raman – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Reinforcement Learning Moral Foundations Theory LLM Value Alignment Jiminy Cricket Text Based Games

To reference this document use

https://resolver.tudelft.nl/uuid:547bc5d8-f9ad-4667-a72e-4383f2fa4ec1

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

02-07-2026

Awarding Institution

Delft University of Technology

Programme

Data Science and Artificial Intelligence Technology

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

4

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

AI agents optimized purely for task completion often inadvertently violate human moral expectations. This is especially pronounced in reinforcement learning agents operating in text-based environments, where the richness of language and the vast action space allow for numerous harmful yet reward-maximizing paths. Existing approaches to moral alignment commonly represent morality as a single scalar signal, limiting both interpretability and the ability to model diverse moral preferences. We introduce MFT Patronus, a policy-shaping framework based on Moral Foundations Theory that represents morality across five dimensions: Care, Fairness, Loyalty, Authority, and Sanctity. This multidimensional representation enables configurable moral profiles that capture different foundational priorities. We evaluate our approach on the Jiminy Cricket benchmark and show that it maintains task performance while substantially reducing immoral behavior compared to the standard baselines and maintaining a net positive balance of moral over immoral actions. Our results further demonstrate that different moral profiles produce distinct behavioral patterns, suggesting that multidimensional moral representations are a promising direction for interpretable and configurable value alignment in reinforcement learning agents.

Files

Patronus_Value_Alignment_of_RL... (pdf)

(pdf | 4.92 Mb)

License info not available