NLP and reinforcement learning to generate morally aligned text: How does explainable models perform compared to black-box models

De Leeuw, Nathaniël

NLP and reinforcement learning to generate morally aligned text

Title

NLP and reinforcement learning to generate morally aligned text: How does explainable models perform compared to black-box models

Author

De Leeuw, Nathaniël (TU Delft Electrical Engineering, Mathematics and Computer Science)

Contributor

Liscio, E. (mentor)
Mambelli, D. (mentor)
Murukannaiah, P.K. (mentor)
Yang, J. (graduation committee)

Degree granting institution

Delft University of Technology

Programme

Computer Science and Engineering

Project

CSE3000 Research Project

Date

2023-07-03

Abstract

This paper evaluates the performance of an automated explainable model, Moral- Strength, to predict morality, or more pre- cisely Moral Foundations Theory (MFT) traits. MFT is a way to represent and divide morality into precise and detailed traits. This evaluation happens in the Jiminy Cricket environment, an environ- ment composed of 25 text-based games. This evaluation helps us estimate the do- main adaptation of MoralStrength, and also its limitations. The explainability of this model helps understand those limitations. We can conclude that MoralStrength is per- forming overall worse than other optimal models and that the domain adaptation to the Jiminy Cricket domain has some cru- cial flaws, but it leads us to think about the explainability/accuracy trade-off and where to draw the line, knowing that explainable models are important for ethical decision- making.

Subject

NLP
RL
Ethics

To reference this document use:

http://resolver.tudelft.nl/uuid:60166a01-cdbe-40af-8cb7-4e5595d108d3

Part of collection

Student theses

Document type

bachelor thesis

Rights

Files

PDF

NathanielDeLeeuwPaper.pdf

598.35 KB

Close viewer