Using NoisyNet to Improve Exploration in Contextual Bandit Settings

Bachelor Thesis (2025)
Author(s)

S. Ruff (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

P.R. van der Vaart – Mentor (TU Delft - Sequential Decision Making)

Neil Yorke-Smith – Mentor (TU Delft - Algorithmics)

Matthijs T. J. Spaan – Graduation committee member (TU Delft - Sequential Decision Making)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
27-06-2025
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Efficient exploration is a major issue in reinforcement learning, particularly in environments with sparse rewards. In these environments, traditional methods like e-greedy fail to efficiently reach an optimal policy. A new method proposed by Fortunato, et al. Fortunato, et al. showed promise by improving efficiency on RL tasks such as Atari games, by driving exploration with learned perturbations of the network weights. Three different types of settings were investigated in order to test the robustness of a contextual bandit implemented with this proposed method: 1. ContextualBandit-v2; a bandit with multiple predefined functions mapping the 1-dimensional continuous input to the reward, 2. MNISTBandit-v0; a bandit rewarding correct identification of MNIST dataset images, and 3. NNBandit-v0; a bandit with the reward being determined by a neural network. Furthermore, non-stationary variants of environment 1. and 3. were tested. A slight variation in hyperparameter sensitivity between environments was observed and a generally optimal set was determined. Overall, NoisyNet-DQNs (Deep Q-Networks) achieved performance comparable to regular DQNs, though often slightly lower. In the high-dimensional stationary MNISTBandit-v0 environment, NoisyNet-DQN converged to an optimal policy slightly faster, at the cost of a larger variation in performance.

Files

CSE3000_Paper_v4.pdf
(pdf | 7.91 Mb)
License info not available