Interval Q-Learning: Balancing Deep and Wide Exploration

Conference Paper (2020)

Authors

G. Neustroev (TU Delft - Algorithmics)

Canmanie T. Ponnambalam (TU Delft - Algorithmics)

Mathijs Weerdt (TU Delft - Algorithmics)

M. T.J. Spaan (TU Delft - Algorithmics)

Research Group

Algorithmics

To reference this document use:

https://resolver.tudelft.nl/66399260-9551-46f5-92a8-bed977898f99

More Info

expand_more

Publication Year

2020

Language

English

Research Group

Algorithmics

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Reinforcement learning requires exploration, leading to repeated execution of sub-optimal actions. Naive exploration techniques address this problem by changing gradually from exploration to exploitation. This approach employs a wide search resulting in exhaustive exploration and low sample-efficiency. More advanced search methods explore optimistically based on an upper bound estimate of expected rewards. These methods employ deep search, aiming to reach states not previously visited. Another deep search strategy is found in action-elimination methods, which aim to discover and eliminate sub-optimal actions. Despite the effectiveness of advanced deep search strategies, some problems are better suited to naive exploration. We devise a new method, called Interval Q-Learning, that finds a balance between wide and deep search. It assigns a small probability to taking sub-optimal actions and combines both greedy and optimistic exploration. This allows for fast convergence to a near-optimal policy, and then exploration around it. We demonstrate the performance of tabular and deep Q-network versions of Interval Q-Learning, showing that it offers convergence speed-up both in problems that favor wide exploration methods and those that favor deep search strategies.

Files

ALA2020_paper_38_1.pdf

(pdf | 0.65 Mb)

License info not available