NLP and reinforcement learning to generate morally aligned text

How does explainable models perform compared to black-box models

More Info
expand_more

Abstract


This paper evaluates the performance of an automated explainable model, Moral- Strength, to predict morality, or more pre- cisely Moral Foundations Theory (MFT) traits. MFT is a way to represent and divide morality into precise and detailed traits. This evaluation happens in the Jiminy Cricket environment, an environ- ment composed of 25 text-based games. This evaluation helps us estimate the do- main adaptation of MoralStrength, and also its limitations. The explainability of this model helps understand those limitations. We can conclude that MoralStrength is per- forming overall worse than other optimal models and that the domain adaptation to the Jiminy Cricket domain has some cru- cial flaws, but it leads us to think about the explainability/accuracy trade-off and where to draw the line, knowing that explainable models are important for ethical decision- making.