REAL Reinforcement Learning

None, None

REAL Reinforcement Learning

Planning with adversarial models

Master Thesis (2022)

Author(s)

D. Foffano (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

F.A. Oliehoek – Mentor (TU Delft - Interactive Intelligence)

Jinke He – Mentor (TU Delft - Interactive Intelligence)

J.C. van Gemert – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Reinforcement Learning Adversarial robustness Markov Decision Processes Robust Decision Making Model-Based Pessimism

To reference this document use:

https://resolver.tudelft.nl/uuid:e357848d-0ae1-47f1-a5ac-46900fcd1225

More Info

expand_more

Publication Year

2022

Language

English

Copyright

Graduation Date

14-01-2022

Awarding Institution

Delft University of Technology

Programme

['Computer Science | Data Science and Technology']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Model-Based Reinforcement Learning (MBRL) algorithms solve sequential decision-making problems, usually formalised as Markov Decision Processes, using a model of the environment dynamics to compute the optimal policy. When dealing with complex environments, the environment dynamics are frequently approximated with function approximators (such as Neural Netoworks) that are not guaranteed to converge to an optimal solution. As a consequence, the planning process using samples generated by an imperfect model is also not guaranteed to converge to the optimal policy. In fact, the mismatch between source and target dynamics distribution can result in compounding errors, leading to poor algorithm performance during testing. To mitigate this, we combine the Robust Markov Decision Processes (RMDPs) framework and an ensemble of models to take into account the uncertainty in the approximation of the dynamics. With RMDPs, we can study the uncertainty problem as a two-player stochastic game where Player 1 aims to maximize the expected return and Player 2 wants to minimize it. Using an ensemble of models, Player 2 can choose the worst model to carry out the transitions when performing rollout for the policy improvement. We experimentally show that finding a maximin strategy for this game results in a policy robust to model errors leading to better performance when compared to assuming the learned dynamics to be correct.

Files

MSc_Thesis_Foffano.pdf

(pdf | 3.09 Mb)

License info not available