Exploring the effects of conditioning Independent Q-Learners on the sufficient plan-time statistic for Dec-POMDPs
More Info
expand_more
Abstract
The Decentralized Partially Observable Markov Decision Process is a commonly used framework to formally model scenarios in which multiple agents must collaborate using local information. A key difficulty in a Dec-POMDP is that in order to coordinate successfully, an agent must decide on actions not only using its own information, but also by reasoning about the information available to the other agents. Nevertheless, existing value-based Reinforcement Learning techniques for Dec-POMDPs typically take the individual perspective, under which each agent optimizes its own actions using solely its local information, thus essentially neglecting the presence of others. As a result, the concatenation of individual policies learned in this way has a tendency to result in a sub-optimal joint policy. In this work, we propose to additionally condition such Independent Q-Learners on the plan-time sufficient statistic for Dec-POMDPs, which contains a distribution over the joint action-observation history. Using this, the agents can accurately reason about the resulting actions the other agents will take, and adjust their own behavior accordingly. Our main contributions are threefold. (1) We thoroughly investigate the effects of conditioning Independent Q-Learners on the sufficient statistic for Dec-POMDPs. (2) We identify novel exploration strategies that the agents can follow by conditioning on the sufficient statistic, as well as their implications on the decision rules, the sufficient statistic and the learning process. (3) We substantiate and demonstrate that by conceptually sequencing the decision-making, and additionally conditioning the agents on the current decision rules of the earlier agents, such learners are able to consistently escape sub-optimal equilibria and learn the optimal policy in our test environment, Dec-Tiger.