Evaluating Performance of Bandit Algorithms in Non-stationary Contextual Environments

Bachelor thesis (2024)

Authors

W. HU Electrical Engineering, Mathematics and Computer Science

Contributors

Julia Olkhovskaya Sequential Decision Making - (mentor)

Ranga Rao Venkatesha Prasad Networked Systems - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:56f7e492-fb2c-4c11-811f-372deae049c8

Published Date

25-06-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

This thesis investigates the performance of various bandit algorithms in non-stationary contextual environments, where reward functions change unpredictably over time. Traditional bandit algorithms, designed for stationary settings, often fail in dynamic real-world scenarios. This research evaluated the adaptability and computational performance of popular algorithms such as UCB, LinUCB, and LinEXP3 using a self-implemented bandit framework. Empirical results reveal significant insights into the trade-offs and optimal strategies for applying these algorithms in non-stationary conditions. Notably, LinEXP3 demonstrated superior performance in complex environments due to its ability to incorporate Bayesian posteriors, despite its higher computational cost. The key contributions of this paper include the empirical evaluation of these algorithms and their implementations, with tailored environment settings. The results suggest promising directions for further research, including the incorporation of broader algorithmic ranges like Contextual Thompson Sampling and other reinforcement learning algorithms adapted for linear contextual settings. Additionally, future work should focus on using real-world datasets to validate these algorithms and introducing covariance matrices for context vectors to simulate more realistic learning processes. These findings could influence the design and implementation of bandit algorithms in practical applications such as recommendation systems and financial portfolio management.

Files

Research_paper_Weicheng.pdf

(.pdf | 0.774 Mb)