Using NoisyNet to Improve Exploration in Contextual Bandit Settings

None, None

Using NoisyNet to Improve Exploration in Contextual Bandit Settings

Bachelor Thesis (2025)

Author(s)

S. Ruff (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

P.R. van der Vaart – Mentor (TU Delft - Sequential Decision Making)

Neil Yorke-Smith – Mentor (TU Delft - Algorithmics)

Matthijs T. J. Spaan – Graduation committee member (TU Delft - Sequential Decision Making)

Faculty

Electrical Engineering, Mathematics and Computer Science

Exploration Deep reinforcement learning Sparse reward problems Contextual bandit settings

To reference this document use:

https://resolver.tudelft.nl/uuid:de29e507-a963-403b-84c0-87ee5350546e

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

27-06-2025

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Efficient exploration is a major issue in reinforcement learning, particularly in environments with sparse rewards. In these environments, traditional methods like e-greedy fail to efficiently reach an optimal policy. A new method proposed by Fortunato, et al. Fortunato, et al. showed promise by improving efficiency on RL tasks such as Atari games, by driving exploration with learned perturbations of the network weights. Three different types of settings were investigated in order to test the robustness of a contextual bandit implemented with this proposed method: 1. ContextualBandit-v2; a bandit with multiple predefined functions mapping the 1-dimensional continuous input to the reward, 2. MNISTBandit-v0; a bandit rewarding correct identification of MNIST dataset images, and 3. NNBandit-v0; a bandit with the reward being determined by a neural network. Furthermore, non-stationary variants of environment 1. and 3. were tested. A slight variation in hyperparameter sensitivity between environments was observed and a generally optimal set was determined. Overall, NoisyNet-DQNs (Deep Q-Networks) achieved performance comparable to regular DQNs, though often slightly lower. In the high-dimensional stationary MNISTBandit-v0 environment, NoisyNet-DQN converged to an optimal policy slightly faster, at the cost of a larger variation in performance.

Files

CSE3000_Paper_v4.pdf

(pdf | 7.91 Mb)

License info not available