Efficient exploration is a major issue in reinforcement learning, particularly in environments with sparse rewards. In these environments, traditional methods like e-greedy fail to efficiently reach an optimal policy. A new method proposed by Fortunato, et al. Fortunato, et al. s
...
Efficient exploration is a major issue in reinforcement learning, particularly in environments with sparse rewards. In these environments, traditional methods like e-greedy fail to efficiently reach an optimal policy. A new method proposed by Fortunato, et al. Fortunato, et al. showed promise by improving efficiency on RL tasks such as Atari games, by driving exploration with learned perturbations of the network weights. Three different types of settings were investigated in order to test the robustness of a contextual bandit implemented with this proposed method: 1. ContextualBandit-v2; a bandit with multiple predefined functions mapping the 1-dimensional continuous input to the reward, 2. MNISTBandit-v0; a bandit rewarding correct identification of MNIST dataset images, and 3. NNBandit-v0; a bandit with the reward being determined by a neural network. Furthermore, non-stationary variants of environment 1. and 3. were tested. A slight variation in hyperparameter sensitivity between environments was observed and a generally optimal set was determined. Overall, NoisyNet-DQNs (Deep Q-Networks) achieved performance comparable to regular DQNs, though often slightly lower. In the high-dimensional stationary MNISTBandit-v0 environment, NoisyNet-DQN converged to an optimal policy slightly faster, at the cost of a larger variation in performance.