Investigating Inverse Reinforcement Learning from Human Behavior

Effect of Demonstrations with Temporal Biases on Learning Rewards using Inverse Reinforcement Learning

Bachelor Thesis (2023)
Author(s)

M. Zatezalo (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Luciano Cavalcante Siebert – Coach (TU Delft - Interactive Intelligence)

A. Caregnato Neto – Mentor (TU Delft - Interactive Intelligence)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2023 Mateja Zatezalo
More Info
expand_more
Publication Year
2023
Language
English
Copyright
© 2023 Mateja Zatezalo
Graduation Date
25-06-2023
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Inverse Reinforcement Learning (IRL) is a machine learning technique used for learning rewards from the behavior of an expert agent. With complex agents, such as humans, the maximized reward may not be easily retrievable. This is because humans are prone to cognitive biases. Cognitive biases are a form of deviation from rationality that affects everyday human decision-making. Time inconsistent decision-making is a type of a temporal cognitive bias where planning of future actions may vary at different points of time. Existing research in this field explores using IRL algorithms in numerous real-life situations. However, few works examine the effects of temporal biases on the recovered reward function. Hence in this research, we propose a methodology to generate synthetic demonstrations that emulate human data with this bias. An existing method, Maximum Entropy IRL (MEIRL) algorithm is used to recover reward functions from expert models containing aforementioned biases and compare them to the performance of unbiased models. The demonstrations are in a form of Markov Decision Process (MDP), implemented in a Grid- World environment. Temporal biases will be implemented within the expert demonstrations as different types of agents that portray a specific behavior. Our findings show that all biases affect reward learning to a considerable extent, with that effect having different magnitudes depending on different comparisons.

Files

RP_Research_Paper_Mateja_Zatez... (pdf)
(pdf | 1.05 Mb)
- Embargo expired in 05-07-2023
License info not available