Exploring Bandit Algorithms in User-Interactive Systems

Influence of Delay on Contextual Multi-Armed Bandits

Bachelor thesis (2024)

Authors

D.C. Arsene Electrical Engineering, Mathematics and Computer Science

Contributors

Julia Olkhovskaya Sequential Decision Making - (mentor)

Ranga Rao Venkatesha Prasad Networked Systems - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science

Contextual Delay Muli-armed bandit Time window size

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:afc4d526-2432-4b60-9cf6-287a1eeed7a2

Published Date

25-06-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Delay is a frequently encountered phenomenon in Multi-armed bandit problems that affects the accuracy of choosing the optimal arm. One example of this phenomenon is online shopping, where there is a delay between a user being recommended a product and placing the order. This study investigates the influence of delay in contextual multi-armed bandit settings, focusing on the performance of the OTFLinUCB algorithm, an adaptation of the Lin- UCB algorithm designed to handle delayed feed- back. The cumulative regret associated with various bandit algorithms, including UCB, Exp3, Lin- UCB, and OTFLinUCB, is examined under different delayed environments. Experiments con- ducted using artificial data generated to simulate real-world scenarios reveal that while delay cannot be entirely mitigated, its impact can be minimized through the strategic selection of hyperparameters, particularly the time window size for OTFLinUCB. The findings indicate that OTFLinUCB’s performance is highly dependent on the chosen window size, which balances the trade-off between estimator accuracy and memory consumption. A method for determining the optimal window size is proposed by introducing an intermediary distribution- agnostic term, conversion rate. Empirical evidence suggests that maintaining a conversion rate of around 70% achieves a balance between minimizing cumulative regret and optimizing memory usage. As a result, this research contributes to the understanding of the delays’ effect on bandit algorithms in contextual environments and offers practical guidelines for tuning OTFLinUCB’s time window.

Files

Final_RP_Paper_Dragos.pdf

(.pdf | 1.32 Mb)