Difference rewards policy gradients

None, None; None, None; None, None; None, None

Difference rewards policy gradients

Journal Article (2022)

Author(s)

Jacopo Castellini (University of Liverpool)

Sam Devlin (Microsoft Research Cambridge)

Frans Oliehoek (TU Delft - Interactive Intelligence)

Rahul Savani (University of Liverpool)

Research Group

Interactive Intelligence

Copyright

DOI related publication

https://doi.org/10.1007/s00521-022-07960-5

Multi-agent reinforcement learning Difference rewards Multi-agent credit assignment Policy gradients Reward learning

To reference this document use:

https://resolver.tudelft.nl/uuid:b47d136e-6080-4876-a299-43c68d7ff46e

More Info

expand_more

Publication Year

2022

Language

English

Copyright

Research Group

Interactive Intelligence

Issue number

19

Volume number

37

Pages (from-to)

13163-13186

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Policy gradient methods have become one of the most popular classes of algorithms for multi-agent reinforcement learning. A key challenge, however, that is not addressed by many of these methods is multi-agent credit assignment: assessing an agent’s contribution to the overall performance, which is crucial for learning good policies. We propose a novel algorithm called Dr.Reinforce that explicitly tackles this by combining difference rewards with policy gradients to allow for learning decentralized policies when the reward function is known. By differencing the reward function directly, Dr.Reinforce avoids difficulties associated with learning the Q-function as done by counterfactual multi-agent policy gradients (COMA), a state-of-the-art difference rewards method. For applications where the reward function is unknown, we show the effectiveness of a version of Dr.Reinforce that learns an additional reward network that is used to estimate the difference rewards.

Files

S00521_022_07960_5.pdf

(pdf | 16 Mb)