Sample-efficient multi-agent reinforcement learning using learned world models

None, None

Sample-efficient multi-agent reinforcement learning using learned world models

Master Thesis (2021)

Author(s)

J.D. Willemsen (TU Delft - Aerospace Engineering)

Contributor(s)

M. Coppola – Mentor (TU Delft - Control & Simulation)

G. C. H. E. de Croon – Mentor (TU Delft - Control & Simulation)

Faculty

Aerospace Engineering

Copyright

Reinforcement Learning Deep learning Multi-agent Model-based Decentralized Control

To reference this document use:

https://resolver.tudelft.nl/uuid:e44e5449-519b-4a21-81e3-a96f1bbb2811

More Info

expand_more

Publication Year

2021

Language

English

Copyright

Graduation Date

28-01-2021

Awarding Institution

Delft University of Technology

Programme

['Aerospace Engineering']

Faculty

Aerospace Engineering

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Multi-agent robotic systems could benefit from reinforcement learning algorithms that are able to learn behaviours in a small number trials, a property known as sample efficiency. This research investigates the use of learned world models to create more sample-efficient algorithms. We present a novel multi-agent model-based reinforcement learning algorithm: Multi-Agent Model-Based Policy Optimization (MAMBPO), utilizing the Centralized Learning for Decentralized Execution (CLDE) framework, and demonstrate state-of-the-art performance in terms of sample efficiency on a number of benchmark domains. CLDE algorithms allow a group of agents to act in a fully decentralized manner after training. This is a desirable property for many systems comprising of multiple robots. Current CLDE algorithms such as Multi-Agent Soft Actor-Critic (MASAC) suffer from limited sample efficiency, often taking many thousands of trials before learning desirable behaviours. This makes these algorithms impractical for learning in real-world robotic tasks. MAMBPO utilizes a learned world model to improve sample efficiency compared to its model-free counterparts. We demonstrate on two simulated multi-agent robotics tasks that MAMBPO is able to reach similar performance to MASAC with up to 3.7 times fewer samples required for learning. Doing this, we take an important step towards making real-life learning for multi-agent robotic systems possible.

Files

Thesis_danielwillemsen_digital... (pdf)

(pdf | 2.48 Mb)

License info not available