Investigating Inverse Reinforcement Learning from Human Behavior

Effect of Demonstrations with Temporal Biases on Learning Rewards using Inverse Reinforcement Learning

Bachelor thesis (2023)

Authors

M. Zatezalo Electrical Engineering, Mathematics and Computer Science

Contributors

L. Cavalcante Siebert Interactive Intelligence - (coach)

A. Caregnato Neto Interactive Intelligence - (supervisor 1)

Faculty

Electrical Engineering, Mathematics and Computer Science

Inverse Reinforcement Learning Maximum Entropy Markov Decision Process Temporal Cognitive Bias Time Inconsistency

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:f253fb16-3aec-48f0-82dc-321fd501a665

Published Date

25-06-2023

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Inverse Reinforcement Learning (IRL) is a machine learning technique used for learning rewards from the behavior of an expert agent. With complex agents, such as humans, the maximized reward may not be easily retrievable. This is because humans are prone to cognitive biases. Cognitive biases are a form of deviation from rationality that affects everyday human decision-making. Time inconsistent decision-making is a type of a temporal cognitive bias where planning of future actions may vary at different points of time. Existing research in this field explores using IRL algorithms in numerous real-life situations. However, few works examine the effects of temporal biases on the recovered reward function. Hence in this research, we propose a methodology to generate synthetic demonstrations that emulate human data with this bias. An existing method, Maximum Entropy IRL (MEIRL) algorithm is used to recover reward functions from expert models containing aforementioned biases and compare them to the performance of unbiased models. The demonstrations are in a form of Markov Decision Process (MDP), implemented in a Grid- World environment. Temporal biases will be implemented within the expert demonstrations as different types of agents that portray a specific behavior. Our findings show that all biases affect reward learning to a considerable extent, with that effect having different magnitudes depending on different comparisons.

Files

RP_Research_Paper_Mateja_Zatez... (.pdf)

(.pdf | 1.05 Mb)