Deep Q-Network Memory Sharing

None, None

Deep Q-Network Memory Sharing

Inter-Agent Prioritised Experience Replay

Master Thesis (2019)

Author(s)

M.D. Hofmeister (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Frans A Oliehoek – Mentor (TU Delft - Interactive Intelligence)

Matthijs T.J. Spaan – Graduation committee member (TU Delft - Algorithmics)

Catholijn Jonker – Coach (TU Delft - Interactive Intelligence)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

DQN Memory Sharing Prioritised Replay

To reference this document use:

https://resolver.tudelft.nl/uuid:5e744262-e365-4112-b3bf-c58a908e2ac7

More Info

expand_more

Publication Year

2019

Language

English

Copyright

Graduation Date

20-02-2019

Awarding Institution

Delft University of Technology

Programme

Electrical Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Humans teach each other by recollecting one's own experiences and sharing them with others. The intention being that the person being taught, does not need to experience those things first-hand to be able to learn from them. A large portion of human learning is in some form derived from this concept. This has inspired this report.
Recent developments in Deep Q-Networks applied a so-called replay memory. This replay memory stores experiences in a buffer for it to learn from later. It is this replay memory that is now used as a source of information for other agents. The core principle being attempting to find important entries in the replay memory and sharing those memories with other agents could improve the learning process of an agent. This memory selection process is done by use of prioritised sampling. The desired result being an agent able to learn faster. This is used to determine the importance of memory samples by influencing how likely those memories are to be sampled for teaching one another.
Temporal difference is chosen as a metric for importance as it represents a surprise value. This method is then compared to a baseline DQN-agent who does not communicate, a randomly messaging DQN-agent and some other variations. The benchmark tasks describe a simple gridworld with objectives with varying levels of difficulty.
The results of the experiments show that, initially, prioritisation results in rapid learning. This is due to the fact that agents are able to converge to identical policies without it having detrimental effects on the return of rewards. This converging, however, has proven to hamstring the learning process as a whole that prioritised messaging agents have. Very similar policies in benchmark tasks which stand to benefit from heterogeneous policies lead to agents not being able to find their personal optimal policy. The non-communicating agent was able to diverge in policies and therefore ultimately solve the problems more efficiently and with a higher return in rewards.

Files

Deep_Q_Network_Prioritised_Mem... (pdf)

(pdf | 7.32 Mb)

License info not available