Exploring the Effects of Conditioning Independent Q-Learners on the Sufficient Statistic for Dec-POMDPs

Conference Paper (2020)
Author(s)

A.V. Mandersloot (TU Delft - Interactive Intelligence)

F.A. Oliehoek (TU Delft - Interactive Intelligence)

Aleksander Czechowski (TU Delft - Interactive Intelligence)

Research Group
Interactive Intelligence
Copyright
© 2020 A.V. Mandersloot, F.A. Oliehoek, A.T. Czechowski
More Info
expand_more
Publication Year
2020
Language
English
Copyright
© 2020 A.V. Mandersloot, F.A. Oliehoek, A.T. Czechowski
Research Group
Interactive Intelligence
Pages (from-to)
423-424
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In this study, we investigate the effects of conditioning Independent Q-Learners (IQL) not solely on the individual action-observation history, but additionally on the sufficient plan-time statistic for Decentralized Partially Observable Markov Decision Processes. In doing so, we attempt to address a key shortcoming of IQL, namely that it is likely to converge to a Nash Equilibrium that can be arbitrarily poor. We identify a novel exploration strategy for IQL when it conditions on the sufficient statistic, and furthermore show that sub-optimal equilibria can be escaped consistently by sequencing the decision-making during learning. The practical limitation is the exponential complexity of both the sufficient statistic and the decision rules.

Files

Bnaic2020proceedings03.pdf
(pdf | 0.672 Mb)
License info not available