NLP and reinforcement learning to generate morally aligned text
How does explainable models perform compared to black-box models
N. De Leeuw (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Enrico Liscio – Mentor (TU Delft - Interactive Intelligence)
D. Mambelli – Mentor (TU Delft - Interactive Intelligence)
P.K. Murukannaiah – Mentor (TU Delft - Interactive Intelligence)
Jie Yang – Graduation committee member (TU Delft - Web Information Systems)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
This paper evaluates the performance of an automated explainable model, Moral- Strength, to predict morality, or more pre- cisely Moral Foundations Theory (MFT) traits. MFT is a way to represent and divide morality into precise and detailed traits. This evaluation happens in the Jiminy Cricket environment, an environ- ment composed of 25 text-based games. This evaluation helps us estimate the do- main adaptation of MoralStrength, and also its limitations. The explainability of this model helps understand those limitations. We can conclude that MoralStrength is per- forming overall worse than other optimal models and that the domain adaptation to the Jiminy Cricket domain has some cru- cial flaws, but it leads us to think about the explainability/accuracy trade-off and where to draw the line, knowing that explainable models are important for ethical decision- making.