Print Email Facebook Twitter Exploring the effects of conditioning Independent Q-Learners on the sufficient plan-time statistic for Dec-POMDPs Title Exploring the effects of conditioning Independent Q-Learners on the sufficient plan-time statistic for Dec-POMDPs Author Mandersloot, A.V. (TU Delft Electrical Engineering, Mathematics and Computer Science) Contributor Oliehoek, F.A. (mentor) Czechowski, A.T. (graduation committee) Jonker, C.M. (graduation committee) de Weerdt, M.M. (graduation committee) Degree granting institution Delft University of Technology Programme Computer Science Date 2020-08-17 Abstract The Decentralized Partially Observable Markov Decision Process is a commonly used framework to formally model scenarios in which multiple agents must collaborate using local information. A key difficulty in a Dec-POMDP is that in order to coordinate successfully, an agent must decide on actions not only using its own information, but also by reasoning about the information available to the other agents. Nevertheless, existing value-based Reinforcement Learning techniques for Dec-POMDPs typically take the individual perspective, under which each agent optimizes its own actions using solely its local information, thus essentially neglecting the presence of others. As a result, the concatenation of individual policies learned in this way has a tendency to result in a sub-optimal joint policy. In this work, we propose to additionally condition such Independent Q-Learners on the plan-time sufficient statistic for Dec-POMDPs, which contains a distribution over the joint action-observation history. Using this, the agents can accurately reason about the resulting actions the other agents will take, and adjust their own behavior accordingly. Our main contributions are threefold. (1) We thoroughly investigate the effects of conditioning Independent Q-Learners on the sufficient statistic for Dec-POMDPs. (2) We identify novel exploration strategies that the agents can follow by conditioning on the sufficient statistic, as well as their implications on the decision rules, the sufficient statistic and the learning process. (3) We substantiate and demonstrate that by conceptually sequencing the decision-making, and additionally conditioning the agents on the current decision rules of the earlier agents, such learners are able to consistently escape sub-optimal equilibria and learn the optimal policy in our test environment, Dec-Tiger. Subject Deep Reinforcement LearningIndependent Q-LearningPartial ObservabilityDec-POMDPMulti-agent To reference this document use: http://resolver.tudelft.nl/uuid:eba94071-5cfa-4132-93a8-5947fccdd731 Part of collection Student theses Document type master thesis Rights © 2020 A.V. Mandersloot Files PDF Thesis_Alex_Mandersloot.pdf 15.57 MB Close viewer /islandora/object/uuid:eba94071-5cfa-4132-93a8-5947fccdd731/datastream/OBJ/view