DMQL: Deep Maximum Q-Learning

Combatting Relative Overgeneralisation in Deep Independent Learners using Optimism and Similarity

More Info
expand_more

Abstract

Various pathologies can occur when independent learners are used in cooperative Multi-Agent Reinforcement Learning. One such pathology is Relative Overgeneralisation, which manifests when a suboptimal Nash Equilibrium in the joint action space of a problem is preferred over an optimal Equilibrium. Approaches exist to combat relative overgeneralisation in Q-Learning problems, yet many approaches do not scale well with the state space or joint action space, are hard to adapt or configure, or are not applicable in partially observable environments.

In this work, we introduce Deep Maximum Q-Learning (DMQL), a methodology combining Deep Recurrent Q-Networks [Hausknecht & Stone, 2015] and the optimistic assumption which can be found in Distributed Q-Learning [Lauer & Riedmiller, 2000]. DMQL is a maximum-based learning technique which can be scheduled to transition to an average-based learner (or any other arbitrary type of learner), which can utilise independent learners without communication. DMQL is designed to be relatively intuitive and easy to adapt and configure and is able to utilise notions of similarity to provide solutions in large and continuous state spaces.

DMQL clusters similar histories by mapping them to the same hash based on a subset of the information contained within them, such as the current observation, or other related available information sources, such as state information. Using these hashes, DMQL constructs a hash-action pseudo-maximum Q-value estimation dictionary which is updated at every gradient update step. A dictionary value degradation technique ensures stability by preventing overestimations from being retained in the dictionary by decaying them after they have been encountered. This way, optimism is introduced, and relative overgeneralisation is prevented without using true maximums of past Q-value estimates, as these are not guaranteed to be indicative of the real optimal Q-values. Contrasting similar deep learning methodologies [Palmer et al., 2017], DMQL augments Deep Q-Network targets through value replacement instead of value discardment, potentially leading to improved efficiency. In addition, DMQL can be adapted to be utilised as a maximisation-based step in the greater learning process of other deep learning algorithms.

Our experimental results indicate that DMQL is a successful extension of Distributed Q-learning, which can be used in small environments even without the usage of similarity. Using similarity, however, grants us the ability to learn in increasingly large and complex environments. Interestingly, various problems exist within the process of developing a suitable manner of incorporating similarity into hashes. We speculate on how these problems can be prevented or circumvented, and our experiments validate our circumvention methods. Lastly, our experiments show that DMQL can successfully be applied to combat relative overgeneralisation in partially observable environments as well.