SS

Sankalp Sankalp Sagar

info

Please Note

1 records found

AI agents optimized purely for task completion often inadvertently violate human moral expectations. This is especially pronounced in reinforcement learning agents operating in text-based environments, where the richness of language and the vast action space allow for numerous harmful yet reward-maximizing paths. Existing approaches to moral alignment commonly represent morality as a single scalar signal, limiting both interpretability and the ability to model diverse moral preferences. We introduce MFT Patronus, a policy-shaping framework based on Moral Foundations Theory that represents morality across five dimensions: Care, Fairness, Loyalty, Authority, and Sanctity. This multidimensional representation enables configurable moral profiles that capture different foundational priorities. We evaluate our approach on the Jiminy Cricket benchmark and show that it maintains task performance while substantially reducing immoral behavior compared to the standard baselines and maintaining a net positive balance of moral over immoral actions. Our results further demonstrate that different moral profiles produce distinct behavioral patterns, suggesting that multidimensional moral representations are a promising direction for interpretable and configurable value alignment in reinforcement learning agents. ...