Conflict in the World of Inverse Reinforcement Learning

None, None

Conflict in the World of Inverse Reinforcement Learning

Investigating Inverse Reinforcement Learning with Conflicting Demonstrations

Bachelor Thesis (2024)

Author(s)

P. Koev (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Mone – Mentor (TU Delft - Interactive Intelligence)

Luciano C. Cavalcante Siebert – Mentor (TU Delft - Interactive Intelligence)

Wendelin Böhmer – Graduation committee member (TU Delft - Sequential Decision Making)

Faculty

Electrical Engineering, Mathematics and Computer Science

To reference this document use:

https://resolver.tudelft.nl/uuid:7abb6efc-35ac-4346-beb3-7d576e59e714

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

23-06-2024

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Inverse Reinforcement Learning (IRL) algorithms are closely related to Reinforcement Learning (RL) but instead try to model the reward function from a given set of expert demonstrations. In IRL, many algorithms have been proposed, but most assume consistent demonstrations. Consistency is the assumption that all demonstrations follow the same underlying reward function and near-optimal policy, without any contradictions. This, however, is not always the case. This study investigates the effect of conflicting demonstrations on IRL algorithms. For our experiments, the Lunar Lander environment and a grid-world environment are used in combination with a state-of-the-art IRL algorithm. To obtain the expert demonstrations, agents were trained using RL algorithms with explicit differences in the reward functions to achieve optimal policy. Then these demonstrations were used in training IRL in a variety of different configurations of hyperparameters. Our results show that IRL algorithms can be trained using demonstrations with varying levels of conflict. In conclusion, we demonstrate that IRL can learn even when provided with a set of conflicting demonstrations.

Files

Research_paper_submission.pdf

(pdf | 1.37 Mb)

License info not available