Sample-efficient multi-agent reinforcement learning using learned world models

Master Thesis (2021)
Author(s)

J.D. Willemsen (TU Delft - Aerospace Engineering)

Contributor(s)

M. Coppola – Mentor (TU Delft - Control & Simulation)

G. C. H. E. de Croon – Mentor (TU Delft - Control & Simulation)

Faculty
Aerospace Engineering
Copyright
© 2021 Daniël Willemsen
More Info
expand_more
Publication Year
2021
Language
English
Copyright
© 2021 Daniël Willemsen
Graduation Date
28-01-2021
Awarding Institution
Delft University of Technology
Programme
['Aerospace Engineering']
Faculty
Aerospace Engineering
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Multi-agent robotic systems could benefit from reinforcement learning algorithms that are able to learn behaviours in a small number trials, a property known as sample efficiency. This research investigates the use of learned world models to create more sample-efficient algorithms. We present a novel multi-agent model-based reinforcement learning algorithm: Multi-Agent Model-Based Policy Optimization (MAMBPO), utilizing the Centralized Learning for Decentralized Execution (CLDE) framework, and demonstrate state-of-the-art performance in terms of sample efficiency on a number of benchmark domains. CLDE algorithms allow a group of agents to act in a fully decentralized manner after training. This is a desirable property for many systems comprising of multiple robots. Current CLDE algorithms such as Multi-Agent Soft Actor-Critic (MASAC) suffer from limited sample efficiency, often taking many thousands of trials before learning desirable behaviours. This makes these algorithms impractical for learning in real-world robotic tasks. MAMBPO utilizes a learned world model to improve sample efficiency compared to its model-free counterparts. We demonstrate on two simulated multi-agent robotics tasks that MAMBPO is able to reach similar performance to MASAC with up to 3.7 times fewer samples required for learning. Doing this, we take an important step towards making real-life learning for multi-agent robotic systems possible.

Files

License info not available