How kernelized Multi-Armed Bandit algorithms compare to other algorithms with fixed kernelized reward and noisy observations
M.K. Herrebout (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Julia Olkhovskaya – Mentor (TU Delft - Sequential Decision Making)
R Venkatesha Prasad – Graduation committee member (TU Delft - Networked Systems)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
The aim of this paper is to challenge and compare several Multi-Armed Bandit algorithms in an en- vironment with fixed kernelized reward and noisy observations. Bandit algorithms are a class of decision-making problems with the goal of opti- mizing the trade-off between exploration and ex- ploitation of all choices. Each decision yields some reward, and the goal is to minimize the regret that follows from a combination of decisions, that is to say to minimize the difference between the set of decisions made, and the set of optimal decisions. In particular, these algorithms deal with the trade- off between choosing the best-known option and exploring new, possibly better options. These al- gorithms are widely used in reinforcement learn- ing, optimization and economics, where decisions need to be made without all the information and with some uncertainty. Each environment is dif- ferent however, and some algorithms are better in some environments than others.