Aligning AI with Human Norms

None, None

Aligning AI with Human Norms

Multi-Objective Deep Reinforcement Learning with Active Preference Elicitation

Master Thesis (2021)

Author(s)

M. Peschl (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Luciano Cavalcante Siebert – Mentor (TU Delft - Interactive Intelligence)

A. Zgonnikov – Mentor (TU Delft - Human-Robot Interaction)

FA Oliehoek – Mentor (TU Delft - Interactive Intelligence)

D. Kurowicka – Graduation committee member (TU Delft - Applied Probability)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Inverse Reinforcement Learning Deep Learning Active Learning Multi-Objective Decision-Making Value Alignment

To reference this document use:

https://resolver.tudelft.nl/uuid:f80e69f0-716d-423a-8124-b834984b7fc5

More Info

expand_more

Publication Year

2021

Language

English

Copyright

Graduation Date

08-10-2021

Awarding Institution

Delft University of Technology

Programme

['Applied Mathematics']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The field of deep reinforcement learning has seen major successes recently, achieving superhuman performance in discrete games such as Go and the Atari domain, as well as astounding results in continuous robot locomotion tasks. However, the correct specification of human intentions in a reward function is highly challenging, which is why state-of-the-art methods lack interpretability and may lead to unforeseen societal impacts when deployed in the real world. To tackle this, we propose multi-objective reinforced active learning (MORAL), a novel framework based on inverse reinforcement learning for combining a diverse set of human norms into a single Pareto optimal policy. We show that through the combination of active preference learning and multi-objective decision-making, one can interactively train an agent to trade off a variety of learned norms as well as primary reward functions, thus mitigating negative side effects. Furthermore, we introduce two toy environments called Burning Warehouse and Delivery, which allow for studying the scalability of our approach in both size of the state space and reward complexity. We find that through mixing expert demonstrations and preferences, we can achieve superior efficiency compared to employing a single type of expert feedback and, finally, suggest that unlike previous literature, MORAL is able to learn a deep reward model consisting of multiple expert utility functions.

Files

Peschl_Value_Alignment_Repo.pd... (pdf)

(pdf | 34.4 Mb)

License info not available