Deep Q-Network Memory Sharing

Inter-Agent Prioritised Experience Replay

More Info
expand_more

Abstract

Humans teach each other by recollecting one's own experiences and sharing them with others. The intention being that the person being taught, does not need to experience those things first-hand to be able to learn from them. A large portion of human learning is in some form derived from this concept. This has inspired this report.
Recent developments in Deep Q-Networks applied a so-called replay memory. This replay memory stores experiences in a buffer for it to learn from later. It is this replay memory that is now used as a source of information for other agents. The core principle being attempting to find important entries in the replay memory and sharing those memories with other agents could improve the learning process of an agent. This memory selection process is done by use of prioritised sampling. The desired result being an agent able to learn faster. This is used to determine the importance of memory samples by influencing how likely those memories are to be sampled for teaching one another.
Temporal difference is chosen as a metric for importance as it represents a surprise value. This method is then compared to a baseline DQN-agent who does not communicate, a randomly messaging DQN-agent and some other variations. The benchmark tasks describe a simple gridworld with objectives with varying levels of difficulty.
The results of the experiments show that, initially, prioritisation results in rapid learning. This is due to the fact that agents are able to converge to identical policies without it having detrimental effects on the return of rewards. This converging, however, has proven to hamstring the learning process as a whole that prioritised messaging agents have. Very similar policies in benchmark tasks which stand to benefit from heterogeneous policies lead to agents not being able to find their personal optimal policy. The non-communicating agent was able to diverge in policies and therefore ultimately solve the problems more efficiently and with a higher return in rewards.